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ABSTRACT 


The work of manually creating a biographical summary from multiple 
information sources is both time-intensive and detail-oriented. Automating the task is 
also non-trivial because of the many NLP areas that must be used to efficiently extract 
the relevant facts. Yet, no study has been done to determine how powerful a biographical 
summarization system must be in order to achieve the basic goal of filling slots in a 
biography template. Equally important, the simplest approaches to discovering and 
extracting biographical information from text have not been implemented. Further, no 
standard evaluations have been developed for summarization in general, but an 


evaluation methodology for this research is described and performed. 
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I. INTRODUCTION 


A. BACKGROUND 


While this age may be referred to as the Information Age, it will be left to history 
to determine how much of the information currently circulated is useful, relevant, and 
important. The advent of technologies like blogs, MySpace, forums, multimedia sharing 
sites, and digital outlets for traditional news sources has created an “‘info-culture” in 
which every person has the opportunity to be an author with a potential world-wide 
audience. The challenges to information management that this type of culture presents 
are apparent. New ways have been and must continually be developed to index and 
search the vast amount of content available. Additionally, methods of information 
extraction are needed in the U.S. intelligence community where not just speed, but speed 


with accuracy is required to return a small number of highly relevant results. 


Many researchers wish to see the process of content selection entirely automated 
based on a user’s needs, but the nature of search dictates that humans will continue to be 
involved. Human involvement dictates that the amount of information presented be 
reasonably small so that the time to review it and make decisions about how to proceed is 
small as well. Again, the notion that it is imperative to minimize the time required to 
review information is a given to entities such as intelligence agencies. Automatic 
Document Summarization (ADS) is one area of research that has historically focused on 
techniques for autonomously creating brief abstracts of larger bodies of text. Variations 
of the technique have surfaced over time to deal with things like creating summaries of 
multiple document (Multi-Document Summarization or MDS) and generating summaries 


of non-textual information, such as photographs and other multimedia. 


When focused on one genre, online news articles, multiple document 
summarization techniques can be adapted to find, catalog, and summarize specific types 
of information. This research attempted to locate and aggregate information about 
entities and create a biographical summary. To be effective, such a summary would need 


to be a type of dossier, citing an entity’s given name, aliases, title, country, etc. The most 
1 


naive approach to creating such a summary is through the use of arbitrary keyword 
selectors. For instance, an attempt to find information about an entity’s birthplace may 
include searching the articles for keywords such as “born,” or “hometown.” This 
approach quickly snowballs toward a desire to understand the sentence because it cues up 


questions such as “born to whom?” and “born where?” 


Sometimes, summarizing one document may be enough to achieve this goal. 
Usually, though, it will be necessary to summarize a collection of documents. As stated 
above, MDS can be used when multiple documents are involved. However, working 
with more than one document at a time presents new obstacles to summarization such as 
redundancy removal and conflict resolution. The general problems of summarization like 
the optimal summary length and what information to include are still present. This 
research is not interested in optimal size (for reasons that will become clear), but conflict 


and redundancy are shallowly addressed for the PakNews domain. 


Automatic Biographical Summarization, or ABS, is a direct application of MDS, 
though it may be a slightly easier problem to solve. The problem may be easier because 
in a robust MDS system, the expected output would be a natural language summary of 
several source documents of indeterminate length. In an ABS system, however, the 
expected output is essentially a list of non-prose facts about an entity derived from the 
source texts. Not only this, in an MDS system the entirety of the document is considered, 
but in an ABS system the documents are examined on a per-sentence basis. The 


differences between ABS and MDS will be further explored in Chapter II. 
B. STATEMENT OF PROBLEM 


Documents containing biographical information are increasingly proliferated as 
the number of information sources continues to grow. Intelligence analysts must sift 
through both publicly available and classified resources in order to create profiles of 
individuals such as foreign heads of state, terrorists, and watch-listed foreign citizen. 
Assembling this information is a time-consuming and detailed process. No publicly- 


known system exists to aid analysts with gathering and condensing the information 


available. This research attempts to create such a system with the belief that it is faster to 
assess the accuracy of a computer-generated dossier than it is to generate a full report 


from scratch. 
C. ASSUMPTIONS 


Science advances by building upon previous research that has been done in an 
area. This research is no exception and a necessary prerequisite for the research 
presented is the tagged output from the Fair-Isaac Entity Disambiguation System. The 
output and the procedure used to generate it are assumed to be ground-truth. So, in each 
phase of research, if Fair-Isaac names a particular entity to be an organization, then we 
will blindly assume this to be the case and develop algorithms for correction if necessary. 
Also, if Fair-Isaac states that two entities are the same, then they are deemed the same. 


Details about the Fair-Isaac system can be found in Chapter III. 
D. METHODOLOGY 


The overall system architecture is summarized in Figure 1 below and provides an 
overview of how the system interacts with the Fair Isaac system and how the output is 
produced. The sentences containing the provided references are compiled into a sort of 
“mini-corpus” for that entity. Further, lists of titles, locations, and organizations were 


collected from Fair Isaac to prime the information extraction process used later. 


1. Sentence Filtration 


A secondary goal for this project was to determine the amount of intelligence 
necessary to extract relevant information from the text. The relevant information which 
is of primary concern can be categorized as “Name,” “Job Title,” “Organizational 
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Affiliations,” “Lifespan Data,” “Quotations,” “Family,” “Associations,” and “Locality 
Information.” To collect this information, three different methods were developed for 
extracting information from the corpus. Each method looked at the problem from a 
slightly different perspective and each required a greater amount of complexity. The first 


method used to extract information was to simply develop several selectors (keywords) 


3 


which filtered sentences into the categories seen above. This method was the simplest, 


though the results derived from it were better than expected. 
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Redundancy Biographical 
Removal Templates 
Conflict Resolution ~———> Merge Data Biography 
Figure 1. Automatic Biography Generation System. 


The second method used slightly expanded on the first method by adding some 
primitive contextual awareness to the filtering process. For example, when attempting to 
extract job titles for an entity, it makes sense to look immediately before the entity’s 
name rather than simply collecting all titles that occur in a sentence. This same kind of 


logic can be used to determine things like associations and family relations. 


The third method starts by examining the content of the original marked-up news 
articles. Then, each time the <PERSON> tag is encountered, a new Person object is 
created with other details from the sentence or article. These accompanying details (if 
present) include the timestamp of the article, surrounding entity names, job title of the 
entity, family information, and lifespan information. The system then merges similar 
Persons by examining the similarity of the details. 


4 


2s Redundancy Removal 


A characteristic of news articles is that they often repeat information found in 
other articles. A key aspect of all four methods is the ability to remove this redundancy 
from the final information presented in the dossier. The system handles redundancy by 


performing several recursive comparison and merge operations. 
3. Merge Data 


A necessary support of this research was the creation of “template” biographies 
for a type of person. These templates were created because often a corpus of news 
articles will not contain every detail that can be known about an individual. However, if 
some characteristics are universally true for a particular type of person, then the known 
data can be merged with the template data to create a fuller biography. For example, the 
U.S. President must be at least 35 years old to hold the office, yet that information may 
never be presented in the source articles. Creating a “U.S. President” template with the 
age filled in as above allows for introducing real-world knowledge into the process 
without corrupting the system’s decisions. Performing this step can fill in gaps in 


knowledge and provide an age of After 1972 for that slot of the biography report. 
4. Evaluation 


To evaluate the system, the primary concern was the correctness of the 
information that was gathered and compiled into the reports. Stated differently, the goal 
of evaluation was to ensure that the facts presented in the different slots of the biography 
were true facts about the entity in question. A more subjective analysis was also 
performed to determine the amount of time it took for each report to be compiled by the 
system as opposed to how long each report would have taken to be compiled by hand. A 
more traditional strictly statistical analysis was not undertaken for reasons explained in 


Chapter IV. 


E. ORGANIZATION OF THESIS 


In this thesis, the background of this work is discussed in Chapter I; a description 
of the research performed is discussed in Chapter III; Chapter IV explains the results 
obtained and offers a brief analysis; finally, ideas for further work in this area are 


discussed in Chapter V. 


I. BACKGROUND AND RELATED WORK 


A. DEFINITION OF PROBLEM SPACE 


A sub-area of Artificial Intelligence (AI) research is called Natural Language 
Processing (NLP). NLP seeks to find ways for computers to read and write documents in 
as human a way as possible. Recent departures into the field have included work on 
applications for searching, indexing, and sorting content of various types. Within NLP, 
there is also interest dating back to the 1950s in automatically generating summaries of 
text similar to ones produced by human abstractors. Though interest in summarization 
waned slightly during the 1970s and 1980s, the advent of the Information Age has 


created a new flurry of research questions to be solved. 


Current media is so summary-laden that the concept seems to be ubiquitous and 
summary generation is perceived as a trivial research question. Part of the reason for this 
perception is the ease with which humans both create and decode summaries to glean 
information. Examples of summaries are all around: television guide synopses, sports 
statistics, news headlines, stock tickers, advertisements, etc. These summaries are 
necessitated by the ever-increasing amount of information available. Information volume 
is also multiplying within the realm of the Department of Defense. A common type of 
summary that could be of particular interest to the DoD during the current War on Terror 


is the biography. 


A biography is not a summary of text or a set of documents, but of a person’s life. 
Independent of a person’s lifespan, the biography is concerned with the events of that 
lifetime, interactions with others, and affiliations (such as group memberships, racial 
profile, etc). Knowing these details about an individual can often give a picture of the 
person’s life. For instance, it would be reasonable to make conjectures about when a 
person lived based upon their contemporaries, their job title, and their parent 
organization. When used in the context of the intelligence community, documents 


containing facts like the ones above about a person of interest are called dossiers. 


Currently, these dossiers, along with larger intelligence briefs, must be compiled 
manually, often after the analyst has gone through the long task of gathering intelligence 
about an individual from various information sources. No known public system exists 
that can sift through these sources and compile a dossier about a person automatically. 
Since these sources are most often intelligence reports in natural language, however, the 


problem becomes one which text summarization can help solve. 


When determining how to apply summarization, three things must be considered: 
input, output, and approach (Barzilay, 03). Even though the focus of summarization 
work is on text summarization, input can be anything that can be represented as text. 
This includes audio transcripts, video descriptions, and compiled text from multiple 
documents. Once the system processes the text, the output can either be a static listing or 
part of a larger query-based database. The amount of text displayed in the output is also 


variable dependent upon the needs of the application and the content area available. 


While input and output are relatively straight forward, the approach (or 
methodology) to perform the summary is the focus of the bulk of summarization work. 
There are several ways to categorize summarization methodologies: extract vs. abstract; 
indicative vs. informative vs. evaluative; generic vs. query based; or single document vs. 
multi-document. Further, different categories can overlap and add an additional layer of 


meaning to the method selected. 


The terms “extract” and “abstract” deal with the origin of the content used in the 
summary. If the summary is an extract of a target text, then each word from the summary 
appears in the source document and has simply been extracted into a compressed form. 
Unlike an extract, an abstract may be a paraphrase or a completely unique retelling of 


what the source text relates. Consider this example about the 9/11 Commission Report: 





Extract 


We present the narrative of this report with a unity of purpose. September 
11, 2001 was a day of unprecedented shock and suffering in the history of 
the United States. The nation was unprepared. How did this happen, and 


how can we avoid such tragedy again? 
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Abstract 


The 9/11 Commission Report is mainly concerned with the events leading 
up to September 11, 2001. Various facets of the issue are examined from 
the teachings of the religion of Islam to the lack of vision by U.S. leaders 
to the failures of equipment used at Ground Zero. The Commission 
presents their analysis of the weak points in U.S. intelligence and 


readiness and proposes solutions. 











Figure 2. Example Extract and Abstract. 


Summaries can also be categorized according to the content they provide for the 
reader. There are generally three types of summaries: indicative, informative, and 
evaluative. Indicative summaries give a clue to the examiner what genre of information 
the source document contains. Since their job is to simply indicate type of content, 
indicative summaries can be quite short. Examples of indicative summaries would 
include the film content information that now accompanies MPAA ratings or the large 
chapter headings within a document like this thesis. Informative summaries are meant to 
inform the reader about some fact or related facts contained in the source text. 
Summaries which are informative are perhaps the most common type of summary and 
encompass such things as news headlines. Finally, evaluative summaries are meant to be 
a critical review of the text. A common occurrence of these summaries are book reviews 


written by consumers on commerce sites such as Amazon.com. 


How the summaries are interacted with is also another way to classify what type 
of summary a system creates. If the summary is a generic one, then it could be intended 
for a number of applications. Generic summaries could be used to reduce the amount of 
text necessary for a search engine to index. Also, they could be used to more easily index 
and classify those texts. Generic summaries are also useful for machine translation, since 
it is less expensive to translate a smaller block of text than the original document. Query- 
based summaries are created for the immediate purpose of answering a user query about 


some information (presumably in a database). The important distinction here is that 


query-based summaries do not usually exist before user input is received. Thus, they 
have the potential to be more dynamic, though fluidity is not required for a query-based 


system. 


The final classification that can be made between summaries is based upon the 
source text. The source text can either be a single document or multiple documents. 
Single document summarization present a number of research challenges, such as 
analyzing discourse structure, selecting salient information, and optimal source 
compression. Multi-document summarization presents the same challenges, but also adds 
the need for redundant information checks, cross-document co-referencing of entities, 


and inter-document conflict resolution. 


Once the type of output desired is selected, the input is known, and an approach is 
settled upon, actual work on the system can begin. The next section will give a detailed 
overview of what the work of summarization looks like. Then, the third and final section 
will present a timeline of the work that has been done in the field to date and some 


commentary on the successes and failures of previous research. 
B. SUMMARIZATION OVERVIEW 


As stated similarly above, there are three general pieces that every automatic 
summarization system is composed of: the input, the approach, and one or more output 
summaries. Again, the input and output are fairly straightforward, but the approach used 
can vary widely depending upon the intended final use of the system. Almost without 
exception, the approach employs some type of compression algorithm. In the context of 
summarization, compression is the term used to describe the process of extracting the 
most relevant content from the source text. Essentially, the source text is being 


compressed into a summary void of any information deemed non-essential. 


While many compression algorithms exist, the decision to employ one over 
another is effected by three components: the audience, the function, and the fluency. 
When performing summarization, the audience can be known (allowing more focused 
summaries) or unknown. Similarly, the function of the summary can be indicative, 


informative, or evaluative. The final component of compression is the desired level of 
10 


fluency. In some circumstances, one may wish to generate a list of bullet points about a 
summarized article while other times a more natural language output is desired. After 
considering these factors, "the compression algorithm will produce one or more output 
summaries that will be a user-defined percentage of the original source material" (Mani et 


al., 01). 


The summarization process can also be viewed as three different phases: analysis, 
transformation, and synthesis (Sparck Jones, 97). Analysis can be performed at either 
the surface level, the entity level, or the discourse level. Intuitively, the surface level of a 
document is defined as being the actual components; the sentences, images, and 
headlines, are all major components of the document's surface level. Examining a 
document at the entity level employs the use of a named-entity recognition (NER) system 
which can identify persons, places, and organizations. Analyzing a document at the 
discourse level is the deepest level above a pure semantic parsing of the text. In short, 
the discourse structure of a document is the flow of meaning in the text which considers 
things such as anaphora, content from previous sentences, and temporal information and 


tries to determine how the elements of a document are related. 


The goal of the transformation phase is to take any information extracted during 
the analysis phase and apply algorithms that will fix any word ordering or incorrect 
grammar situations created. It is important to note that the deeper one analyzes the text, 
the more complicated the transformation algorithm may become. This growth in 
complexity is caused by removing smaller pieces from a document that must be plugged 
into a larger framework rather than extracting larger pieces, like sentences, which can 
usually be pieced together more easily. A secondary goal of transformation is to remove 
seemingly unnecessary details from components, such as descriptive phrases, etc., 


depending on the level of compression desired. 


The final stage of the summarization process, the synthesis of the output can be as 
straight forward as concatenating everything provided by the transformation phase. This 
is rarely the case, however, because in most summarization situations a close to natural 


language output is desired. In order to achieve the resemblance to natural language, the 
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synthesis phase is truly a language generation module that performs content and lexical 
selection, aggregates phrases into sentences, and creates its own discourse structure 


(Jurafsky et al., 01). 


Each piece of language generation is an area of research all of its own. Content 
selection deals with examining the content provided by the transformation stage and not 
only choosing the information that is most relevant from the source text, but also 
determining what would most aid the summary’s coherence. Lexical selection is 
beneficial to compression because it may be profitable to replace phrases from the source 
text with single words. For example, the source text may contain a vernacular phrase 
such as “up the creek” which the generation module could excise and replace with some 
synonym referring to misfortune. Once content has been selected and all ideas have been 
expressed in their best form, the text synthesizer must construct sentences from the 
smaller pieces. This can sometimes be accomplished through the use of heuristics and a 
sentence template. Once a group of sentences have been formed, they must be chained 
together to form the summary. Doing this can be very complicated, but can be aided by 
the use of an underlying discourse structure, which keeps track of the concepts involved 


in the sentences and uses heuristics about them to cohesively relate them. 
C. PRIOR WORK 


The foundational application of text summarization was the automatic creation of 
abstracts for research papers (Luhn, 58). The approach attempted to follow the intuition 
that the main subject of an article would be a word that appears frequently throughout the 
article (determiners and prepositions were ignored). To decide which sentences to 
include in the abstract, the system performed analysis on individual sentences and 
measured the significance of each one. Significance of a sentence was determined by 
looking for the presence of significant (1.e., often repeated) words from the overall article. 
From the smaller set of sentences chosen as significant, a probabilistic algorithm was 


developed to rank the sentences in order of significance. Sentences that scored above an 


12 


arbitrary threshold became part of the abstract. Luhn’s work was very influential in 
directing future researchers to look for statistical techniques when working with 


summarization. 


The next major work on summarization was focused on creating indicative 
abstracts of scientific articles while also attempting to create an adaptive research 
methodology (Edmundson, 69). Like Luhn’s work, Edmundson attempted to 
computationally model human abstractors by looking for “significant” sentences. To do 
this, he expanded the definition of a significant sentence from one that contained high- 
frequency keywords to sentences that contained cue words or heading words. He defined 
cue words as words belonging to one of three sub-areas: bonus words, stigma words, and 
null words. Bonus words like “significant” were clues to ideas which were probably 
essential to the paper’s theme. Stigma words like “hardly” were defined as words which 
clued an opposing position to the papers theme. Finally, null words did not affect the 
theme one way or another. The group of null words was composed of ordinals, the verb 


“to be,” prepositions, coordinating conjunctions and other less significant parts of speech. 


Edmundson also factored in sentence location within a document. For example, a 
section in his source material may have begun with “In this paper...” This sentence and 
others similar to it were granted special weights to denote that they probably contained 
thematic information. Sentences that concluded sections were also given special 
attention. According to his experimentation, the method which considered locality 
received higher marks for proper co-selection of information than the other methods 


tried. 


Edmundson’s results overall, however, were not promising. The poor 
performance of his system after seventeen iterations of experimentation led him to 
conclude that “it is now beyond question that future automatic abstracting methods must 
take into account syntactic and semantic characteristics...they cannot rely simply upon 
gross Statistical evidence.” This result was a hard blow to summarization research in 
general, because Luhn’s work twelve years before had led most researchers to believe 


that the problem would be quickly and easily solved by more modern algorithms. 
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After Edmundson, steady work in the area of automatically creating abstracts 
continued, though interest waned in the second half of the 1970s. The focus continued to 
be on generating natural language abstracts for scientific (particularly chemistry) articles. 
This led to stagnation in the field because the vision was completely limited to this small 
domain which did not provide much material. However, an abundance of material would 
not have been beneficial at this time, because the methods available to gather data to train 
and test systems were very costly and time-consuming. Then, once the data was 
assembled, it had to be analyzed by hand or reviewed by professional abstractors — 


another timely and costly job. 


Summarization research was practically non-existent during the 1970s and 1980s. 
Instead of attempting to develop actual systems, researchers turned to the more 
theoretical aspects of the discipline with hopes of making a revolutionary breakthrough. 
(Paice, 89) posited that there were only seven major approaches for determining sentence 
significance. They were: frequency-keyword, title-keyword, location, syntactic criteria, 
cue words, indicator-phrases, and relational criteria. Paice concluded that a frequency- 
keyword approach is trivial and un-informative. However, he did not completely 
discredit work based on keyword frequency but declared other word matching 


approaches such as cue words and indicator phrases as more likely to yield better results. 


Out of the rest of the methods, Paice determined that a syntactic criterion (i.e., the 
distribution of the word throughout the document) was unviable (Earl, 70). This was 
mainly because the work done by Earl focused on modeling sentences as phrase structure 
representations and after looking at 3,000 sentences, 99% were unique structures. The 
progress in the area of Earl’s work allows us to now determine that she “overfit’” her 
experiment to her data. Several of the other “new” approaches, such as the frequency- 
keyword, title-keyword, location, and cue words, were simply expressions of Luhn and 


Edmundson’s previous work compiled in a new form. 


Interest in document summarization began to grow again in the 1990s with the 
advent of the Internet, which affected the field of summarization in two profound ways. 
First, the Internet provided the opportunity to gather and process much larger collections 


of data than was previously possible. The foundational research had worked with a 
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miniscule amount of data (Edmundson worked with only 200 documents) in comparison 
to what could now be catalogued online. The Internet also provided a fresh need and 
desire for summarization. As the amount of information on the Internet exploded, new 


ways were sought to visualize, structure, and organize data. 


Kupiec, Pedersen, and Chen (Kupiec et al., 95), developed one of the first modern 
document summarizers. Their stated goal was to “develop a classification function that 
estimates the probability a given sentence is included in an extract given a training set of 
documents with hand-selected document extracts.” Their approach was similar to 
previous work in that it required a pre-compiled corpus of scientific documents and their 
extracts. Instead of following the past approach of weighting each word and then 
weighting each individual sentence, however, they used probability and statistics. They 
derived a Bayesian classification function to assign a score to each sentence based upon 
distinct features. The score can then be used to determine which sentences to include into 
a summary. By using a classification function and training on a corpus of documents, 
they were able to allow the corpus to set the weight of each feature instead of arbitrarily 
setting the weights in the beginning. The scheme to score features was derived from 


(Paice, 89). The main evaluation of the system yielded an 83% correctness. ! 


Parallel to the research into creating extracts from text, other research was also 
concerned with producing abstracts of text. An abstract in the sense of summarization is 
a short description about an article that contains little or no material from the original 
article. Another approach was to use a standard template for what a user would prefer an 
abstract to look like and then to populate the template with material extracted from the 


text (McKeown et al., 95). Their work became known as the SUMMONS system. 


While news of SUMMONS spread, Myaeng and Jang (Myaeng et al., 96) 
followed Kupiec’s approach and created their own probabilistic system. Their approach 
was slightly modified, however, because they started by manually identifying not only 


features but components of the text. The sentences were then scored on not just the 


! For their experimentation, correctness was defined by “the fraction of manual summary sentences 
that were faithfully reproduced by the summarizer program.” 
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presence or absence of features, but also which textual component it belonged to. Once 


again, the highest ranked sentences became a part of the final summary. 


The next major research in 1999 produced a system known as DimSum (Aone et 
al., 99). The work was particularly ambitious as evidenced by the introduction of the 
paper where each approach for generating summaries is chronicled and the discussion 1s 
concluded with this statement: “Our work addresses challenges encountered in these 
previous approaches...” (Aone et al., 99). The team did address several issues. First, the 
very foundation of the statistical approach was altered by using text statistics and corpus 
Statistics to determine that a word such as “bill” in “reform bill” should be counted 
independently of “bill” in “Bill Clinton.” Second, instead of using Kupiec’s scheme of 
finding entities based upon capital letters, they were able to use SRA’s NameTag™ (with 
accuracy in the mid-90%) to tag their corpus with names, places, etc. Third, the team 
took advantage of WordNet (Miller et al., 90) to find synonyms to words in the text. This 
allowed them to bolster the counts of some words by including synonym counts. Even 
when using some of these advanced tools, however, the final summary was still created 
by selecting sentences with high scores. Also, scores were still generated by the counts 


of individual words which make up the sentence. 


Still, most of the focus remained on creating generic summaries until others began 
to research producing biographical summaries (Schiffman et al., 01). Acknowledging 
that book-length biographies are beyond the capability of computers, the researchers 
focused on creating a short paragraph containing biographical information from a corpus 
of 1,300 news documents about the Clinton-Lewinsky affair (termed the Clinton corpus). 
The approach did not pre-suppose the presence of any particular information in the 
corpus, but it instead allowed the corpus to dictate what was stated about any particular 
entity. Also, the research did not take temporal cues into account, which has only 
recently been addressed (Bethard et al., 07). To generate output, canned text was used to 
fill in the gaps between extracted texts. While the results and methodology of the paper 
are generally un-impressive, the most important contribution made to summarization 


research was the creativity to envision a biographical summarization system. 
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A second approach to automatically generate biographies used entity recognition 
coupled with an ontology (Alani et al., 02), or “a set of distinct objects resulting from an 
analysis of a domain” (Martin et. al, 01). In this case, the researchers sought to 
automatically generate biographical summaries of famous painters from information 
found on various reputable Internet sites. The system marks the first attempt to directly 
interface a biography generation system with the Internet. It was also the first system that 
sought to produce dynamic summaries for a query-based system in which the user could 
choose what “view” of the biography they wished to see (where each “view” focused on 


a different aspect of the author’s life.) 


While their final output appeared impressive, there are a few subtle points about 
their research that lessen the luster. First, the researchers chose to follow Schiffman’s 
example and use templates to render the final biography. Second, they made their system 
heavily domain dependent by developing an ontology to fit the data that their system 
would be processing. Third, the actual procedure used to extract information from other 


web sites, which the system uses as a dynamic corpus, is left to the reader’s imagination. 


These early attempts to automatically generate biographies are examples of 
systems that dealt with the elements of the source text at an entity level. In information 
retrieval theory, entities “are things of interest; one might say objects of interest...the 
objects that a system is designed to store and retrieve” (Smiraglia, 02). Thus, in light of 
the desire to extract information about persons from news articles, entities are primarily 


formal names of individuals, locations, and organizations. 


Recognizing entities in open text is now done fairly accurately.2 Yet determining 
which entity names in the text are part of the set of names used to refer to a particular 
person in real life is a separate issue known as automatic entity disambiguation, or cross 
document co-referencing. One system that performs automatic entity disambiguation 
was developed at Fair Isaac (Blume, 05). In a corpus of Pakistan News Agency articles, 
the Fair Isaac system was able to achieve greater than 95% accuracy when determining 


named entities and the algorithm used to merge two entities achieved over 99% accuracy. 





2 Evidence of this can be found in a recent presentation from Microsoft Research: http://www.mathcs. 
emory.edu/~eugene/talks/cikm2005.ppt. 
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Merging entities involved resolving many-to-many relationships to determine, for 
example the difference between references to a cricket player named Yasir Arafat and the 


deceased leader of the PLO by the same name. 


Zhou, Ticrea, and Hovy (Zhou, 05) attempted to address biographical 
summarization in light of the fact that information is being extracted from multiple 
documents. They followed the methodology of Aone, choosing to use information 
retrieval and classification techniques to extract information from their source corpus. To 
make the final output of their system useful, they also chose to store the biographies and 


create an interface through which users could query the data. 


To train their information retrieval algorithms, they annotated a corpus of 130 
biographies about 12 different individuals. Through doing so, they discovered several 
common components of each biography: lifespan data, popularity, personality, personal, 
social, education, nationality, scandal, and work. Each sentence in the corpus utilized 
was Classified with one of the above labels and also labeled as to whether it belonged in 
the final biography or not. The sentence’s presence in the final output was determined by 


its score, which was again determined by textual and corpus statistics. 


The most recent research into biographical summaries focused not on creating a 
full narrative about a person’s life, but on answering biographical questions about a 
person (Feng et al., 06). The output they expected from their system would accurately 
answer questions such as “When was Albert Einstein born?” based upon information 
extracted from web pages. To aid efficiency, they proposed to use data mining to gather 


information ahead of user queries and cache the answer for later use. 


The current state of the art in biographical summary is not clearly defined. It 
depends entirely upon the focus and domain of the research. Natural language 
biographies have been generated, but only using templates and canned text. Query based 
systems have been developed, but the answers returned are generally tidbits of a 
biography and not the entire generic biography. Further, evaluation of summarization 
systems in general, including biographical summarization systems, remains an open 


research question. 
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HiIl. RESEARCH DETAILS 


A. GENERAL DESCRIPTION 


Automatic summarization continues to evolve as a complex and multi-faceted 
problem that is deservingly receiving much academic attention. Yet, as demonstrated in 
the previous chapter, very few researchers are approaching the problem of automatic 
biography summarization (ABS). Perhaps more troublesome, the current line of 
biography summarization is perceived as a completely parallel line of research to 
automatic document summarization. Thus, researchers automatically apply the latest 
document summarization research to ABS work to ensure the construction of robust 
systems. These “robust” systems, however, are generally limited to one domain and one 
function and have been given far more complicated intelligence than what is actually 


needed to accomplish the task. 


While many techniques used in document summarization should undoubtedly be 
applied directly to ABS work, the fundamental question about how powerful a system 
would need to be to perform ABS has been left unanswered. Are full discourse analyses, 
named entity recognition, and robust semantic classifiers necessary or can the job be 


performed using mostly keyword selectors and some knowledge of the target documents? 


Since that basic question had not been answered in the literature surveyed, the 
hypothesis which guided experimentation was that keyword filters and word location 
would be sufficient to achieve a system with a very high degree of information integrity. 
In order to test this hypothesis, four versions of the system were developed to perform the 
keyword filtering in different ways. The different versions represent the “approach” 


piece of the summarization process, composed of the input, approach, and output. 
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B. INPUT DETAILS 
Each of the four versions of the system used the following input sources: 
i. PakNews Corpus 


Put simply, a corpus is a collection of texts (Saggion, 04). When Fair Isaac 
sought to build their automatic disambiguation system (Blume, 05), they needed to 
collect a group of documents that would be well suited for their purpose. They decided 
to use news articles, which is logical given that news articles are generally full of 
references to different entities. In particular they collected articles over a 44 month 
period from the Pakistan News Service. The primary reason why this particular service 
was chosen was because the PakNews Service is written mostly by amateurs who may 
have differing levels of expertise transliterating Arabic names to English. This allows for 
a single name to be spelled (and misspelled) multiple different ways — an ideal challenge 
for a disambiguation system. The articles are also conveniently accessed via the Internet 


at http://www.paknews.com? and each article has an accompanying timestamp. 
2. Fair Isaac 


The system provided by Fair Isaac is primarily written in Perl with a user 
interface coded in TCL/Tk. The overall goal of the system is to disambiguate references 
to people with the same name. Within the field of natural language processing, this is 
referred to as cross-document co-referencing. The system creates feature-based vectors 
and compares them to each other to determine whether one entity is the same as another. 
For instance, in the PakNews corpus, there are two entities referred to as Yasser Arafat. 
One Arafat was the head of the Palestinian Liberation Organization while the other is a 
cricket player. Originally developed for use in determining a person’s credit score, the 
system is able to disambiguate the references with very high overall accuracy (greater 


than 95%). 


3 At the time of this writing, this site is no longer active. 
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While the full explanation of how the disambiguation is performed can be found 
in (Blume, 05), it may be profitable to describe the process in general. The original 
PakNews corpus has been tagged with XML to identify document elements such as 
headlines, document boundaries, and document timestamps and mentions of named 
entities. Each entity mention (person, organization, and location) is given an 
identification number and placed into a XML master list of corpus entities. This list is 
processed by the system and produces a new XML list of disambiguated entities, 
complete with a list of all identification numbers that are the same entity. These 
identification numbers are pointers back into the original articles where the entity of 
interest is mentioned. An example of the output of this processing step can be seen in 


Figure 3. 





<PERSON ID="p_osama_laden_001_p" LNG="Osama bin Laden">741 
1568 1907 2268 3720 4294 4427 4643 4706 4750 4871 5508 6094 
6336 6920 6931 7041 7245 7417 8654 8973 9010 9273 9430 9446 
9584 9632 9687 10209 10238 10303 10307 10903 11338 11418 
11446 13800 14582 15032 15385 15484 15550 15666 15709 16018 
17922 18143 18473 18524 21125 21225 21376 21378 21381 21382 
21422 21434 21518 21538 21540 21554 21557. 

















Figure 3. Output from entity disambiguation step. 


These mentions are traced to their original positions within the documents and the 
enclosing sentence is extracted as a unit. Each entity mention appears in the original 


“cc 


news articles in the following way: <PERSON ID="741" 








STD="p_osama_laden_p">Laden</PERSON> allies suspected behind Buddha 





destruction.” Conveniently, the original news document data files place one sentence 
per line which is terminated with a newline character. Thus, there is no need for complex 
sentence tokenization past splitting the string based on the appearance of newline 


characters. 


As stated above, the PakNews corpus also contains occurrences of organizations 
and locations tagged in a similar manner to persons. Using this tagged corpus along with 


the entity mention IDs, the preliminary conclusion was that finding information about 
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those entities would be relatively easy. Thus, the information from the disambiguation 
system was used as input to the biography generation system. The lists representing the 


input received are listed in Appendix A. 
C. PROCEDURE 


Each version of the system was written in Java and developed using the Eclipse 
IDE. Though most data structures were custom-built, some relied heavily upon the built- 


in Java regular expression, input/output, and hash classes. 


When working with the data from the Fair Isaac system, several pre-processing 
steps were necessary before the work of information extraction could begin. To pre- 
process the input, several data files were created to allow for easier handling. Since the 
corpus being used was a static collection of news articles and had already been processed 
by an named entity recognition (NER) system, several categories of data were considered 
“closed” (i.e., the world existed of only what the NER identified). These closed data 


categories included titles, organizations, and locations. 


The members of these different categories were each placed in separate files and 
loaded into the biography summarization system as arrays of patterns. Meanwhile, 
several classes were developed to handle the data and perform the work of 
summarization. A diagram of the sub-system developed for pre-processing the data is 


provided in Figure 4. 


me 


UniqueEntities.xml 





References.txt 


ReferenceCounter 


Ordering ————————> References_Sorted.txt 





———_ Entity_Content.txt 





Figure 4. Input pre-processing subsystem. 


The ReferenceCounter class processes the Fair Isaac system’s output, 
UniqueEntities.xml. Specifically, it processes the lists of mentions for each entity. If an 
entity does not contain an arbitrary number of mentions, then it is discarded. The class 
creates a new file to contain the new list of entities and adds some information, such as 
the entity’s name (according to Fair Isaac), markers denoting the beginning and end of 


mention references, and a count of how many entities were returned. 


As stated above, the output of the disambiguation process is a long list of entities 
and their references. The disambiguated list originally contains 55, 695 persons.* Of that 
number, 46,498 persons are only mentioned once while only 125 are mentioned more 
than 100 times. For the purposes of this research, only content about persons mentioned 
in the articles 100 times or more was compiled. The reasoning behind this stemmed from 
the desire to avoid a sparse data problem and will become more apparent when the output 


is discussed later in the chapter. 
While ReferenceCounter will print the references for each entity in numeric order, 


the listing of entities is still sorted alphabetically. The Ordering class puts the entities in 


4 From this point on, the reference to persons will technically refer to a person entity. 
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numeric order to make searching the corpus easy. Searcher does just as its name implies 
by sweeping through the corpus to find PERSON tags. Since all PERSON tags in 
PakNews have a corresponding identification number, and since these numbers are in 
numeric order throughout the corpus, the list of entity reference mentions can be 
processed at the same time as the corpus. If the PERSON tag contains an identification 
number that is currently being looked for, then it is added to the possessing entity’s block 


of information. The final output from this step is passed to each version of the system. 


While each version of the system is concerned with matching keywords in the text 
pertaining to each entity, they each go about it in slightly different ways. Each version, 
though, does make final decisions about what to declare as an entity’s job title, 
organization, and location based upon the argmax over the counts of the elements 


(explained below). Table 1 presents an overview of each system. 





Version Number | Approach 





1 Primitive keyword matching. No cue words or word 
locations were taken into account. No redundancy removal. 
2 Keyword matching with attention to word location and cue 


words. Wholly discarded sentences that were thought to be 
quotations. No redundancy removal. 





3 Keyword matching with attention to word location and cue 
words. Used basic information (job title, organization, 
location) found in quotation. Decided city based upon 





country. 
4 Keyword matching with attention to word location and cue 
words. Modified name filter to increase precision. 


Excluded false familial phrases such as “father of the 
nation” from the familial filter. 














Table 1. Overview of different versions. 
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if System Iterations 


a. Version 1 


During processing, the persons’ references and their content were held in a 
large array of Person objects. The overall system algorithm dictated an approach that 
iterated through the array of Person objects and ran each of the filters below on each of 
the sentences in the current person’s content block from PakNews. Once the filtering of 
information was complete for all of the sentences, decisions about an entity’s name, title, 
organization, and location were completed and stored before moving to the next entity. 


The sentence filters and the decision processes are now discussed in more detail: 


Entity Name Filter. A simple name composed of the entity’s first and last 
name was obtained from the Fair-Isaac Entity Disambiguation System. This name is the 
basis for a pattern of what the system will look for in the sentences which mention the 
entity, Se, and is also expanded upon to look for any additional middle names. For each 


mention, m, of an entity, e, found in Se, we can compute a count of the occurrences of m: 


count(m) = > 1 


mes, 


Each mention in S, that does not correspond to a previously discovered 
mention is added to a list, L,, which represents the possible names that could be assigned 
to an entity. For instance, U.S. President George Bush could rightfully be mentioned in 
text as G.W. Bush, G. Bush, George W. Bush, etc. Formally, the list is defined as the 


following: 
L, ={m|((m =e) A(meS,) A(m, #m,))} 


To decide which version or format of a name is correct for a 


particular entity, we argmax over the counts of the mentions in L,. 


Name, = arg max count(m) 


mel, 


2 


The decision reached (Name,) is the final name published in the biographical summary. 
Note that this approach does not take into account misspellings in the name or variations 


of the same name. Instead, it treats each first and last name pairing as unique. 


Entity Title Filter. Before filtering sentences, we collect all of the possible 
titles from the corpus. Again, the Fair-Isaac system allows us to bootstrap our 
implementation by making determinations for us about what is a title and what is not. 
After creating a list of titles, we look for those instances when a title, f, is used in the 
entity’s content S,. A running tally (computed below) of how many times each title is 


found is stored for later analysis. 


count(t) = yl 


teS, 
Each title and count pairing is stored in a list, L,, defined as: 
L={th(t=e)ateS,)aG #t,))} 


To arrive at a decision of what title to assign to a person, we again argmax over the 


counts of the titles collected for that person. 


Title, = arg max count(t) 
teL 


The final title decision is then published in the final biographical summary. 


An interesting note about titles in the PakNews corpus is that Arabic titles 
sometimes imply a deeper meaning than just a person’s position. For example, the title 
Maulvi is used to denote a Sunni Muslim religious leader. Thus, from a title we may be 
able to initially deduce a person’s religious orientation and in the example above, the 
specific sect of religion adhered to. In a non-Arabic context, a title such as “Commander- 
in-Chief’ may be a reference to the President of the United States and provide 


information about a person’s range of influence and their nationality. 


Entity Organization Filter. Similar to the titles from the PakNews corpus, 
all possible organizations are identified and compiled into a single list. Each organization 


name is then searched for in the set of sentences pertaining to the entity, defined as S.. 
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When an organization, 0, is located in the content, it is added to a list, L,, of possible 


organizations to which the current person may belong. 
L, ={ol(o=e,)AeES,)A(, #0,))} 


If an organization is encountered a second time, a count for that particular organization is 


incremented. Formally, 


count(o) = vl 


oeS, 


To then determine an entity’s organization affiliation, we argmax over the number of 


counts for each organization and assign the one with the highest count to the entity. 


Organization, = arg max count(o) 


oeL, 


This decision is published in the final biography as the organization choice for a 


particular entity. 


Entity Location Filter. Locations in the PakNews corpus are tagged as 
such. Again, a listing of these locations can be compiled and searched for in each entity 
block, S.. In the case of locations, however, cities and countries show up as a pair of 
strings separated by a comma (e.g., Islamabad, Pakistan). So, instead of deciding the 
location as a unit including both city and country, the location string is split in two. As 
before, the cities and countries found in the relevant content are associated with a person 
as possible locations. A count is maintained for each location (i.e., city or country) 


found. For cities, 


count(Ici) = > 1 


IcieS, 


For countries, 


count(Ico) = » 1 


IcoeS, 
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Each possible city location is added to a list of cities, Li-;, while each possible country 


location is added to a list, Licg: 


Ly, = {Ici | (Ici = @,,;) A (Ici € S,) A (Ici, # Ici,))} 
Ly.. = {Ico | ((Ico = é;,,) A (Ico € S,) A (Ico, # Ico,))} 


Then, we argmax over the counts of the cities and the countries and arrive at an 


independent decision for each. 


Location, = arg max count(Ici) 


Ici € Ly 


Location, = arg max count(Ico) 


Ico € Leg 


This allows the system to come to a decision about the city independently of its decision 


about the country. 


Entity Quotation Filter. The most basic approach to capturing quotations 
was based upon matching the double quote character (“). The implication of doing this 
was that words placed inside quotations for a purpose other than quotation were captured 
and direct quotes that were not in quotations were not captured. To combat this, 
additional expressions were added to look for clue words such as “said” or “says” that 
indicate the presence of at least an indirect quote. Since quotes in news articles are said 
by individuals with their own agenda, the information they contain cannot necessarily be 
treated as fact. Eliminating facts contained in quotations is slightly troublesome, since 
the conditions arise that the first-hand sentiment of someone who witnessed a terrorist 
attack performed by an entity is ignored because it appears as a quotation. This 
elimination risk is minimized somewhat by the high volume of redundancy in PakNews. 
They are still valuable for determining an entity’s frame of mind, however, and can be 
useful to a dossier. For instance, the only evidence of an entity’s intention to commit a 
terrorist attack may be contained in a quotation. This possibility makes their inclusion in 


the final biographical summary essential. 
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Entity Familial Relationship Filter. This filter seeks to uncover the names 
of spouses, children, descendants, and ancestors. To find these relationships, keywords 
and phrases such as “father,” “father of,” “mother,” and “in-law,” are searched for in the 
source text. If such keywords are discovered, the sentence in which they were embedded 


is extracted and made apart of the entity’s final biographical summary. 


Entity Professional Relationship Filter. Meetings, conferences and 
debates between other persons and organizations characterize professional relations. A 
less popular professional relation exists between persons and organizations when the 
former lectures or gives a speech to the latter. This is a weaker relationship because an 
organization is made up of many members, all of whom may or may not hold certain 
things in common with the person. To find information about professional relationships, 


99 66. 


keywords like “meeting,” “met with,” and “‘addressed” are searched for in the source text. 
As when searching for familial relationships, if keywords are discovered, then the entire 


sentence is extracted and added to the entity’s biographical summary. 


Entity Lifespan Filter. The lifespan filter deals with the birth and death 
dates of an entity. It can also collect the dates of the articles, providing a fixed reference 
point in time to base judgments about relative temporal information in the article (e.g., 
last Thursday, yesterday, etc). Though the parsing of temporal information does not take 
place in any version of the system, strings matching a date format (e.g., 5-12-1980; May 
12, 1980; etc.), and keywords referring to life and death such as “born,” “died,” and 
“birth” were searched for in the relevant source text. Again, sentences containing the 


keywords are extracted and added to the final biographical summary of the individual. 


b. Version 2 


The initial results received from version | precipitated the modification of 
several filters to adjust the amount of information that was being collected by the system. 
The decision processes remained the same, but the content being fed into those processes 
was changed by modifying the filters. The changes to the filters affected is described 


below: 
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Entity Title Filter. Instead of simply looking for all co-locations of titles 
with the entity’s name, the filter only captures the title if it appears immediately before 
the first or last name of the entity. This allows the filter to have some concept of context 
and also allows for consideration of word ordering. In broad terms, word ordering refers 
to the syntactic structure of a sentence (1.e., the order of the words). As titles from the 
global list were found in the source text, they were added to a list of possible titles. As in 
version 1, we then argmax over the counts of the titles and choose the resulting string as 
the most likely title of the entity. This title appears in the final biographical summary of 
the entity. 


Entity Quotation Filter. While the original approach insisted on collecting 
quotations and assigning them to the source, this proved extremely difficult to evaluate 
and made the system fairly inefficient. Further, it could be argued that quotations reflect 
opinion and therefore cannot be trusted as sources of truth of any kind. For these reasons, 
the quotation filter became a litmus test for sentences. If they contained a quotation, then 


that was noted and the sentence was skipped. 


c. Version 3 


On the system’s third iteration, some efficiency issues were addressed. 
Class files were written for Organizations and Locations which became a wrapper for the 
array of patterns and strings that had been used previously. Instead of using an array of 
Persons to hold information during execution, however, one reusable Person object was 
created and the content was printed to the file after the current person was processed. 
There were also changes to different filters and the information collected was processed 


by a function that removed redundancy (described in the next section). 


Entity Name Filter. Some names provided by the entity disambiguation 
system were composed of only one name as opposed to a first and last name. Thus, the 
words before and after the given name were examined to see if any dominant name 
emerged. The collection and decision process were unchanged from version 2 of the 


system. 


30 


Entity Organization Filter. In the list of organizations recovered from the 
entity disambiguation system’s output, organizations were paired with their 
corresponding acronym. Previously, just the organization name was searched for, but 
this was changed to look for both the presence of an organization’s name and/or the 
presence of the organization’s acronym. When deciding the organization, the instances 


of the organizations acronym were attributed to the score of the organizations full name. 


Entity Quotation Filter. In the previous two versions, if a sentence was 
found to be either a formal or an informal quotation, it was ignored. The reasoning for 
this was based on not wanting to pollute any decisions made with opinion. The impact 
on ignoring these quotations was examined, however, and it was discovered that 
including them in the data to influence the system’s decision of the person’s name, job 


title, organization, and location was beneficial. 


Entity Location Filter. The entity location filter had been separated to 
decided country and city of an entity independently until this version. As locations were 


encountered in the text, the tally is still calculated as before: 


For cities, 


count (Ici) = by 1 


IcieS, 


For countries, 


count(Ico) = y, 1 
IcoeS, 
These locations were then compiled into two lists, Li¢; and Lico, defined above. In version 
3, though, the functionality was merged to base the decision of the most likely city upon 
the decision made about the most likely country. Once the most likely country was 
determined, the most likely city within that country was chosen to complete the person’s 


most likely location. Formally, 
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Location, = arg max count (Ico) 


Ico € L,, 


Location, = arg max count(Ici) 
2 Ici EL, ,C 


where Location, is the country slot of the entity biography, Location, is the city slot of 


the entity biography, /co is each unique country encountered, /ci is each unique city 
encountered, and C is defined as the set of all possible cities that are located in a 


particular country. 
d. Version 4 


Some filters experienced minor changes to improve performance. 


Entity Name Filter. The name filter was adjusted to be more precise by 
accounting for whitespace between the words. The name filter had been looser before 
and was set to match anything between the first and last name. This allowed for 
“Abdullah” to be matched for “Abdul.” Also, instead of looking at the word before and 
after one word names, a simple space before and after to delineate the name was added 


instead. 


Entity Familial Relationship Filter. This filter was improved to exclude 
common phrases found within PakNews such as “father of a nation” from the indicative 


list of familial relationships. 
2. Redundancy Removal 


It is assumed that when dealing with news articles that material will overlap from 
one document to the next. This redundancy subsequently appears in the information 
collected by the different sentence filters. During experimentation, three types of 
redundancy were encountered and can be classified as duplication, encapsulation, and 
affirmation. Duplication is the most basic form of redundancy and the easiest to spot in 
which the complete sentence appears twice. When duplication was encountered, the 
second occurrence was removed from the final biography. Encapsulation deals with part 


of a sentence being inside of another sentence. To manage this case, the smaller sentence 
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was removed and the longer sentence was preserved. Affirmation is the trickiest type of 
redundancy and was not addressed by this research. When one sentence affirms another, 
the information related in the sentence is the same, but the words used may be completely 
disjoint. This form of redundancy requires a more robust semantic framework to 


determine similarity and could be an independent research topic. 


3. Output 


The original vision for the output was a simple text display of facts relevant to 
different categories of information about the entity. The final version of the output was 
formatted in XML (for future query based work) and, for some categories, the entire 
relevant sentence was displayed. A sample of the type of output produced can be seen in 


Figure 5. 





<PERSON ID="12"> 
<NAME> 








<DECISION>Aftab Ahmed Khan Sherpao</DECISION> 
</NAME> 
<TITLES> 
<DECISION>Mr</DECISION> 
<7 TLTLES 
<LOCATION> 
<DECISION>Islamabad, Pakistan</DECISION> 
G Fi LOCATION> 
<ORGANIZATION> 
<DECISION>Accountability Court</DECISION> 
</ORGANIZATION> 
<FAMILY> 
<FAM ID="1">Moreover, Sikandar Sherpao, son of 
Aftab Ahmed Khan Sherpao will start his 
parliamentary career from the NWFP assembly 
his time</FAM> 
</FAMILY> 
<ASSOCIATES> 
<REL ID="1">Shujaat informed that he would 
meet PML-Q allies including National Alliance, 
Aftab Sherpao and MOM</REL> 
</ASSOCIATES> 
</PERSON> 




































































Figure 5. Sample output from the system. 


D. DETAILS OF CODE 


A partial listing of core classes is presented in Appendix B. The source code for 


the fourth and final version of the system can be found in Appendix C. 
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IV. EVALUATION COMMENTARY 


A. EVALUATION OPTIONS 


Evaluation of automatic summarization applications continues to be an unsolved 


problem (Jing et al., 98). The reason for this is that summarization goals vary so widely 


that it is almost impossible to determine a universal evaluation for all systems. In 


general, however, summarization evaluation methods divide into two categories: intrinsic 


and extrinsic. 


An intrinsic evaluation is focused on evaluating the content of a summary. This 


includes its textual coherence and how useful the information in the summary is to the 


given task. (Mani, 01) outlines several intrinsic evaluation methods: 


a. 


Summary Coherence — Since most summarization system output is in 
natural language, either because the information is extracted or because 
some language generation module has been applied, readability of the end 
result is a necessary indicator of a worthwhile summary. Information can 
be extracted in such a way that dangling anaphors exist or the discourse 
structure of the original text is broken. The coherence of a summary is 
usually subjectively evaluated by humans who give summaries a grade on 


a scale determined by the researchers. 


Summary Informativeness — The summary of a source text can be 
completely coherent, yet lack substance of any sort. Thus, the information 
extracted must be analyzed based upon what role the summary is expected 
to fulfill. Informativeness, like coherence, can be judged subjectively, but 
it can also be scored according to the standard definitions of precision and 
recall. Recall in this case measures how many human-created reference 
summaries contain the same information as the machine-created summary. 
If the extraction algorithms are sound, a remaining obstacle to 


informativeness may be the amount of compression of the source 
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document was required. There is no acceptable way to compare the 


informativeness of variably compressed summaries. 


c. Comparison Against Input — This form of evaluation involves the 
informativeness of the summary, but can also encompass more than that. 
The options for performing this type of evaluation are using semantic 
methods or surface methods. Using semantic methods requires that each 
sentence in a text be hand-tagged with a meaning. The summary can then 
be judged by how many topics from the original source it covers. A 
surface method can be performed by identifying key passages in the 
source text and then looking for the presence of that content in the 


summary. 


So, while intrinsic evaluation focuses on the actual summary, extrinsic 
evaluations are oriented around the ability of the summary to make a job easier for 
humans to perform (relevance assessment (Mani, 01); reading comprehension (Morris, 
92); etc). Usually, the impact of a summary system on a job is a reduction in the time 
necessary to reach the conclusion one would have reached by reading the entire source 
document. In the TIPSTER SUMMAC evaluation (Mani, 01), which created indicative 
summaries for articles that were then sifted by government intelligence analysts, the 
“relevance assessment time [was reduced] by 40% ...to 50%, with no statistically 


significant degradation in accuracy.” 


The other primary use of extrinsic evaluations, reading comprehension, allows 
humans to read full length documents or summaries and then answer multiple choice 
reading comprehension questions. If users reading summaries are able to score as equally 
high as users reading the full document, then it can be concluded that the summary is 
highly informative. Similar experiments (Hovy et al., 98) have been conducted in which 


users must create the original source document based upon the summary. 


Broader objective and subjective methods exist to judge summaries as well. 
Objective methods for evaluating summarization output generally compare human 


summaries to ones generated autonomously. The most common and oldest method is to 
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have human abstraction experts create summaries for the content in question and then 
compare the system’s summary based upon similarity of words, content selection, or 
other characteristic. This group of methods can be time-consuming and costly, however, 


especially if there is a large volume of content. 


More standard statistical evaluations such as calculating the precision, recall, and 
F-score also fall into this group of methods, though it is not always clear how to define 
them in this domain. Precision is generally defined as the number of correct items given 
by a system divided by the total number of items given by the system.° In this definition, 
“items” is left somewhat ambiguous and the term must be quantified for an individual 


system before an evaluation can be initiated. The general formula for precision (P) is: 


True Positive 





7 (True Positive + False Positive) 


Related to precision, recall is generally defined as the number of correct items 
provided by a system divided by the total number of correct items in the original text.© 


The general formula for recall (R) is: 


True Positive 


at (True Positive + False Negative) 


Finally, the F-score is a measurement that balances precision and recall by taking 
the harmonic mean’ of the two. The general formula for the F-score (F) is: 
_ (2*P*R) 
(P+R) 
Subjective methods focus on assessing a summary’s informativeness and 
coherence. If this is to be done scientifically, it requires multiple reviewers who have 


established methods for rating the information contained in the summary and how well 


the information flows (applicable only when creating a summary in natural language). 


5 Jurafsy and Martin. Speech and Natural Language Processing. Prentice Hall 2003, p. 578. 
6 Ibid. 
7 A discussion of the harmonic mean is found here: http://en.wikipedia.org/wiki/Harmonic_mean. 
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The problem with this set of methods, however, is that reviewers must be familiar with 
all of the content contained in the original article in order to know if all relevant 


information is contained in the summary. 


Remembering that the biography summarization is attempting to model an 
automatic dossier generator, there are only a few options when performing an evaluation. 
An intrinsic evaluation based upon sentence precision can be performed, while a more 
limited evaluation of sentence recall can also be performed to ensure that precision is not 
being extended at the expense of recall. Further, an extrinsic evaluation can be 
performed based upon how much estimated time is saved in the process of automatically 


generating the biographies. 


For the time constraints of this research, it was sufficient to examine 18 
biographies’ from the output of processing the 125 persons that were mentioned more 
than 100 times in PakNews. The output produced by each version of the system was 
examined and compared against facts that could be verified from the corpus. The 


comparative results for each version of the system can be seen in Figure 6. 
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Figure 6. Chart Displaying Overall Intrinsic Evaluation Results. 


8 This accounts for almost 15% of the final output. 
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B. INTERPRETATION OF RESULTS 


As can be seen in Figure 6, each of the four versions experienced most of the 
same peaks and troughs, meaning that for the most part, each version had similar 
difficulties and triumphs. However, with each iteration of the system, the performance 
curve shifted upwards, resulting in version four achieving the best results overall. The 


individual results of each version and the peculiarities observed will now be discussed. 
1. Version 1 


As detailed in Chapter III, version 1 of the system simply used term frequency to 
determine the person’s organization, title, and location. Term frequency is a general 
measure of how often a term is found in a collection of documents. It is widely utilized 
in information retrieval, most often to provide a weight for a certain term given a certain 


document. “There are several variants, but a common form of the governing equation is: 
N 
w, = tf, “log, a (3) 


where wi; is the weight of term t; in document dj, tfij is the frequency of term t; in 
document d;, N is the number of documents in the corpus, and n is the number of 
documents in the corpus in which term t; occurs” (Mani et al., 01). In this research, the 
actual weights were not needed, but the frequency of each title, organization, and location 


in each entity’s block was a necessary piece of information to make a decision. 


Moreover, a blind test for the presence of certain cue words like “meet” and 
“brother” triggered the system to extract the sentence and classify it as either associate or 
familial information respectively. The span of the results seen stretch from an 88.9% 
high to a 38.3% low. The mean score was 61.2% while overall system precision P was 


calculated as follows: 


number of total correct inclusions 
P= - - = 58.9% 
total number of inclusions 
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Version 1 Precision 
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Figure 7. Precision of version 1. 


The results from version 1 spurred more research questions. The system decided 
country and city independently to calculate the location, yet the system decided the city 
correctly less than 6% of the time. Meanwhile, the correct country was chosen 76% of 


the time. What was causing the discrepancy? 


Also, the filters used to clue into familial relationships in the text were not 
working as evidenced by only a 38% success rate. The mystery of these results was 
clearly unraveled, however, by noticing that one person, Muhammad Ali Jinnah, was 
known as the “Father of the Nation of Pakistan.” So, though he was mentioned as such 
44 times, only 15 times was his actual family mentioned. Without his inclusion, the 


familial relationship filter of version | operated at approximately 46%. 


2. Version 2 


Version 2 of the system used term frequency with word ordering to determine 
organization, title, and location. Improvements were implemented in the familial 
relationship filter so that simple phrasal constructs such as “son of’ and “father of’ would 
be recognized. For version 2, the results spanned from 88.9% to 28.6%. The mean 
precision was 63.5% and the overall performance of the system was 59.2%. The biggest 
improvement of the system in this version was the taming of the city list which was found 
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to contain many different location errors, such as locations that did not actually exist. 
How this came to be is unknown, as these locations did not appear in the original 
PakNews corpus. After hand correcting the list, the overall precision of the city decision 
algorithm was 35.3%. The familial filter still struggled, however, and actually decreased 
in performance to 35.3%. This was caused by the inclusion of more information by the 


filter even though less of what was captured was accurate about the person of interest. 
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Figure 8. Precision of version 2. 


3. Version 3 


The third iteration of the system continued to perform better than the first two. 
The largest improvement for this version was the actual person names and job titles. 
Previous best for these two categories was 97% and 78% respectively. Version three 


pushed the name precision to 100% 9 and increased the title precision to 93.8%. 


9 This should come as no surprise to the reader, since this is essentially given information from the 
entity disambiguation system. 
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Figure 9. Precision of version 3. 


The graph of the performance of version 3 (Figure 9) does not look very 
dissimilar from version 2. In fact version 3’s overall performance was 67%, just an 8% 
increase from version 2. Version 3’s results spanned from 36.2% to 88.9% with a mean 
precision of 69.4%. Deciding an entity’s location continued to be a problem. In version 
2, the system’s precision was fairly high for deciding countries and extremely low for 
deciding cities. Version 3’s new method for determining location flattened the scores of 


each category and caused them both to average out around 50%. 
4. Version 4 


Version 4 scored the best of all of the versions and achieved a better score in all 
categories except for title and organization. However, since these filters were developed 
in a modular way, the better performing filters from version 3 could easily be “plugged 
in” to version 4. The performance impact on the overall system’s precision was not 
statistically significant. What was significant was that ignoring vernacular phrases in the 
familial filter caused the precision of that filter to increase from 37% to 55%. Also 


significant was the increase in the precision of the location filters. The city filter 
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exceeded 70% and the country filter exceeded 80%. The graph of the precision of this 
system shows that the balance between what to include versus what was correct about the 


entities was almost attained. 
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Figure 10. Precision of version 4. 


The span for version 4’s results went from 42.8% to 100% and achieved a mean 
precision of 77.6%. The overall system performed at 80.4%, a 13% increase over version 
3. Due to the very high ratio of inclusions to correct inclusions, it was decided that any 
additional improvements to the system would benefit in only minimal gains and the limits 


of keyword filtering had probably been reached. 


43 


THIS PAGE INTENTIONALLY LEFT BLANK 


44 


V. CONCLUSIONS AND FUTURE WORK 


A. CONCLUSIONS 


In this thesis, the implied standing assumption that a sufficient biography 
summarization system would have to be equipped with the latest natural language 
processing tools was challenged. Just as automatic summarization is a divergent branch 
from the main trunk of natural language processing research, so biography summarization 
is sufficiently different from regular summarization to warrant a fresh look at techniques 
being used for the research being done. In general, it is not sufficient to simply use rigid 


summarization approaches for biography summarization. 


Several lessons were learned from this research. First, even in a corpus as vast as 
PakNews, the large majority of entities will only be mentioned once. This shortage of 
final data complicates evaluation. However, several summarization projects were limited 
to a corpus of between 100 and 200 documents, so in this light the number of persons is 


adequate. 


Second, providing regular expressions that will fit every case contained in a 
corpus while attempting to preserve generality is difficult. Several expressions went 
through iterations of change as small subtleties in how they were being used came to 
light. For example, placing a space after the given first name of an entity made all the 
difference between selecting the name “Abdul” who was a Sheikh and “Abdullah” who 


was a Prince. 


Third, a statistical keyword filtering approach is sufficient enough to reach 80% 
correctness (i.e., precision). While a semantic parsing module may have taken the 
performance to near 100%, the amount of work required to achieve accurate parsing may 
not always be justified in every situation. If a system was being developed for the DoD 
to perform the same type of work, it would need to be close to 100% accurate, so it would 


be prudent to implement a semantic parsing algorithm. 
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B. FUTURE WORK 


While the basic goal of producing biographical summarizations was achieved, 
there were many subtleties of the research that did not come to fruition. First, it would 
have been desirable to produce a version of the summarization system that did rely on 
semantic interpretation of the text, in order to quantitatively show how much value this 
added in the final evaluation. The immediate obstacle for dealing with semantics was 


that the PakNews corpus had not previously been tagged with part-of-speech tags. 


Second, more filters could have been developed to deal with several other 
different types of information. For example, there have been recent methods detailed to 
extract and interpret temporal information (Bethard et al., 07). Given such information, a 
very robust timeline for the person could have been created and would have allowed for 
the determination of if the person was still living, when different family would have died, 
etc. When working with temporal information, inference is also key to understanding the 


text. 


It would have also been profitable to perfect the nationality and lifespan filters. 
The major problem with these filters was sparse data from the corpus. Yet, it is possible 
that this would be expected since the news articles being processed are neither wholly 
about one person, nor are they encyclopedia articles about the person. The temporal 
filtering described above could have aided the lifespan filter inasmuch as knowing the 


earliest date an entity is mentioned and extrapolating their birth as being before that date. 


Third, the output produced could be useful in social networking applications and 
query-based applications. A front-end interface for the system was envisioned which 
would parse the XML output and present the user with the concise facts regarding a 
person. A map could be displayed based on the location information gathered. In fact, a 
map showing all locations that an entity is mentioned could also be useful as a pictorial 
way to trace a person’s movements. In addition to the mapping possibilities, the social 
networking visualization possible is substantial. Named associates of the current person 
can be placed on a graph or grid with each associate becoming a node. User’s could then 


click nodes and navigate through a chain of persons to reveal degrees of separation 
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between two persons. Finally, the XML backend data driving the interface could be 


completely indexed for quick data retrieval and to facilitate question answering. 


In the area of evaluation, several additional approaches could still be addressed. 
An evaluation could be performed over the difference between the time necessary to 
generate the biography by hand versus by machine. The hand generated summaries could 
also prove as a basis to score the generated biography for correctness and 
informativeness. In addition to these relatively subjective evaluations, an evaluation 
could also be conducted over each version of the system based upon sentence recall. It is 
a Statistical fact that precision can be increased at the expense of recall and vice versa. A 
full evaluation of the 18 entities examined in the current evaluation needs to be 


undertaken to see how recall was impacted by the decreasing information inclusions. 


In conclusion, the topic of automatic biographical summarization is a fascinating 
one that is both relevant and challenging in our current age. In addition to the scientific 
research that must still be conducted to determine methods to create nearly perfect 
biographies autonomously, philosophical issues of privacy and individual rights must be 
addressed as well. The current growth of information online that is publicly available 
and available for sale is leading our culture to an ever-increasing demolition of individual 
secrecy. Systems like the one described in this research must be developed with the 
safeguards of a will to use the information gleaned for mankind’s benefit and not for 


personal gain. 
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Senior Vice Chairman 


Senior Vice Foreign Minister 
Senior Vice Foreign Minister Mr 
Senior Vice Minister 

Senior Vice President 

Senior Vice President Doctor 
Senior Vice President Haji 
Senior Vice President Sheikh 
Sergeant 

Sergeant First Class 
Sergeant Major 

Serviceman 

Shah Imam 

Shaheed 

Shaheed Captain 

Shaheed Doctor 

Sheikh 

Sheikh Doctor 

Sheikh Hazrat 

Sheikh Justice 

Sheikh Sultan 

Sheriff 

Shortstop 

Sir 

Sir Sayyid 

Sir Staff General 

Sister 

Skipper 

Skipper Captain 

Solicitor General 

Speaker 

Speaker Mr 

Speaker Sayyid 


Spokesman 

Spokesman Brigadier 
Spokesman Brigadier General 
Spokesman Captain 
Spokesman Colonel 
Spokesman Commander 
Spokesman Doctor 
Spokesman General 
Spokesman Interior Secretary 
Spokesman Lieutenant 
Spokesman Lieutenant Colonel 
Spokesman Major 
Spokesman Major General 
Spokesman Mr 

Spokesman Mr Foreign Minister 
Spokesman Mullah 
Spokesman Professor 
Spokesman Rear Admiral 
Spokesman Sayyid 
Spokesman Sheikh 
Spokesman Sultan 
Spokesperson 

Spokesperson Major General 
Spokeswoman 
Spokeswoman Captain 
Spokeswoman Major 
Squadron Leader 

Staff Admiral 

Staff Admiral Sir 

Staff General 

Staff General Sir 


Staff Lieutenant General 


Staff Lieutenant General Sayyid 
Staff Major General 
Staff Sergeant 
Staff Sir 
Striker 
Sultan 
Sultan Mian 
Sultan Prince 
Superintendent 
Supervisor 
Supreme Commander Mullah 
Supreme Leader Ayatollah 
Supreme Leader Maulana 
Supreme Leader Mullah 
Technical President 
Technology Advisor 
Technology Doctor 
Technology Lieutenant Colonel 
Technology Minister 
Technology Minister Doctor 
Technology Minister Professor 
Technology President Professor 
Technology Prime Minister Professor 
Technology Professor 
Technology Professor Doctor 
Telecommunications Minister 
Telecommunications Mr 
Terrorist 
Trade Minister 
Trade Minister Baroness 
Trade Minister Mr 
Trade Mr 
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Trade Mrs 

Trade Representative 
Trade Representative Ambassador 
Trade Representative Mr 
Treasurer 

Treasurer Doctor 
Treasury Secretary 
Treasury Secretary Mr 
Treasury Undersecretary 
Umpire 

Undersecretary 

Vice Admiral 

Vice Captain 

Vice Chairman 

Vice Chairman Mian 
Vice Chairman Sayyid 
Vice Chancellor 

Vice Chancellor Professor Doctor 
Vice Chief Brigadier 
Vice Foreign Minister 
Vice Foreign Minister Mr 
Vice Marshal 

Vice Minister 

Vice Premier 

Vice Premier Doctor 
Vice Premier Mr 

Vice President 

Vice President Chairman 
Vice President Doctor 
Vice President Haji 

Vice President Mian 


Vice President Minister 
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Vice President Mr 
Vice President Qazi 
Vice President Sayyid 
Vice President Sheikh 
Vice Prime Minister 
Wicketkeeper 
Wicketkeeper-batsman 


Winger 


ORGANIZATIONS 


accountability bureaulab 
accountability courtlac 

aga khan development networklakdn 
aga khan education servicelakes 

aga khan foundationlakf 

aga khan rural support programlakrsp 
aga khan universitylaku 

agricultural development bankladb 
agricultural linkageslagricultural linkage 
agriculture departmentlagri department 
agriculture development bankladb 
agriculture ministrylagri ministry 
agriculture researchlagri research 

air chief marshallacm 

air chiefslair chief 

air forceslair force 

al qaedalaq 

al qaidalaq 

all china women federationlacwf 

all india radiolair 

all pakistan mohajir students organizationlapmso 


all pakistan newspapers employees confederationlapnec 
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all pakistan newspapers societylapns 
all pakistan women associationlapwa 
all parities conferencelapc 

all parities hurriyat conferencelaphc 
all party conferencelapc 

all party hurriyat conferencelaphc 

all party hurriyet conferencelaphc 
allama iqbal open universitylaiou 
allied bank limitedlabl 

american business councillabce 
american centreslamerican centre 
american civil liberties unionlaclu 
american israeli public affairs committeelaipac 
american muslim councillamc 
amnesty internationallai 

anjuman tajiran sindhlats 

ansar burney welfare trust internationallabwti 
ansar burney welfare trustlabwt 

anti narcotic forcelanf 

anti terrorism courtlatc 

app corporationlapp corporation 

arab leaguelal 

argentina cricket associationlaca 
army medical collegelamc 

army public schoolslarmy public school 
army strategic forces commandlasfc 
army welfare trustlawt 

asean regional forumlarf 

asian cooperationlasian co 

asian cricket foundationlacf 

asian development bankladb 

asian development fundladf 
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asian pacific economic conferencelapec 
asian parliamentarianslasian parliamentarian 
asian squash federationlasf 

askari information systemslais 

ataullah mengallata mengal 

augaf departmentlauqaf dept 

australia cricket boardlacb 

australian broadcasting corporationlabe 
australian cricket boardlacb 

austrialia cricket boardlacb 

awami action committeelaac 

awami leaguelal 

awami national conferencelanc 

awami national partylanp 

ayub medical collegelamc 

babri masjid action committeelbmac 
babri masjid movement coordination committeelbmmcec 
balochistan high courtlbhe 

balochistan national movementlbnm 
balochistan national partylbnp 
balochistan olympic associationlboa 
baluchistan assemblylbaluch assembly 
baluchistan national partylbnp 

bank of chinalboc 

bank of khyberlbok 

bank of punjablbop 

banking departmentlbanking dept 
banking services corporationlbsc 

bar associationslbar association 

barani area development projectlbadp 
berkeley nucleonic corporationlbnc 
bhabha atomic research centerlbarc 
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bhabha atomic research centrelbarc 
bharat sanchar nigam limitedlbsnl 
bharatiya jana sanghlbjs 

bharatiya janata partylbjp 

bharatiya janta partylbjp 

bhartia janta partylbjp 

bhartyia janata partylbjp 

blue berretslblue berret 

bolan medical collegelbmc 

boxing championlboxing champ 

boy scouts associationslboy scouts association 
british airwayslba 

british broadcast companylbbc 

british broadcasting corporationlbbc 
british councillbe 

bush administrationlbush administration 
business software alliancelbsa 

business support centrelbsc 

cabinet committee on privatisationlccop 
cable news networklcnn 

canadian international development agencylcida 
capital development authoritylcda 
caribbean media corporationlcmc 
carrier telephone industrieslcti 

central executive committeelcec 

central intelligence agencylcia 

central intelligence departmentlcid 
central police officelcpo 

central reserve police forcelcrpf 

central superior serviceslcss 

chief election commissionlcec 

china gold associationlcga 
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china national petroleum corporationlcnpc 
china radio internationallcri 

china software industry associationlcsia 
cholistan development authoritylcda 

citizen community boardlccb 

citizen peace committeelcpc 

citizens peace committeelcpc 

civil aviation authoritylcaa 

clipsal pakistan limitedlcpl 

coalition information centerlcic 
commonwealth development corporationlcdc 
commonwealth ministerail action grouplcmag 
commonwealth science councillcsc 
community welfare attachelcwa 
comprehensive economic partnershiplcep 


comsats institute of information technologylcomsat institute of information 
technology 


comstech information technology centerslcomstech information technology center 
congressional budget officelcbo 

counsel generalslcounsel general 

cricket world cuplcwc 

criminal appealslcriminal appeal 

ctv national newslcnn 

cultural counselloricultural co 

defence consultative groupldcg 

defence divisionld division 

defence intelligence agencyldia 

defence productionldp 

defence research development organisationldrdo 
defense consultative groupldcg 

defense housing authorityldha 


democratic freedom partyldfp 
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democratic liberation partyldlp 

democratic national committeeldnc 
democratic political movementldpm 
department for international developmentldfid 
department of defenceldod 

department of defenseldod 

department of justiceldept of justice 

deutche welleldw 

deutsche welleldw 

dhaka international trade fairlditf 
diphosphate amonium phosphateldap 

dir area support projectldasp 

directorates of intelligenceldirectorate of intelligence 
district cricket associationldca 

district management groupldmg 

district public safety commissionldpsc 
economic advisory boardleab 

economic affairs divisionlead 

economic cooperation organisationleco 
economic cooperation organizationleco 
economic coordination committeelecc 
economic coordination organizationleco 
economic management grouplemg 

election commissionerslec 

election commissionerslelection commission 
election commissionerslelection commissionlec 
election commissionlec 

election tribunalslet 

electronic data systemsleds 

electronic government directoratelegd 
emerging scienceslemerging science 
emirates cricket boardlecb 
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engineering development boardledb 

england cricket boardlecb 

environment protection agencylepa 
environment protection departmentlepd 
environmental protection agencylepa 
environmental protectionlenvironment protection 
environmental tribunalslenvironment tribunals 
eu councillec 

euorpean unionleu 

european commissionlec 

european commissionleu commission 
european parliamentariansleuropean parliament 
european parliamentlep 

european parliamentleu parliament 

european unionsleuropean unionleu 

european unionleu 

export development fundledf 

export facilitation centerlefc 

farhad companylfc 

farm machinery institutelfmi 

fauji fertilizer company limitedlffcl 

fauji fertilizer companylffc 

federal and provincial governmentslfederal and provincial government 
federal councillfc 

federal courtlfc 

federal government services hospitallfgsh 
federal intelligence agencylfia 

federal investigation authoritylfia 

federal investigative agencylfia 

federal public service commissionlfpsc 
federal public services commissionlfpsc 
federal security bureaulfsb 


OF 


federal serviceslfederal service 

federal shariat courtlfsc 

federal tax ombudsmanlfto 

federal trade commissionlftc 

finance departmentslfinance department 
finance departmentlfinance deptlfd 
financial action task forcelfatf 

financial advisorslfa 

first information reportslfir 

first information reportlfir 

food departmentlfood dept 

foreign currency exchange serviceslfces 
foreign direct investmentlfdi 

foreign ministrylfm 

foreign officelfo 

foreign serviceslforeign service 

foreign trade ands export corporationlforeign trade and export corporation 
forest departmentlforest dept 

gawadar development authoritylgda 
general electriclge 

general motorslgm 

general sales taxlgst 

ghazi barotha hydro power projectlgbhpp 
ghazi barotha taraqiati idaralgbti 

global broadcast networklgbn 

global change impact studies centrelgcisc 
global information technologylgit 

global water partnershiplgwp 
government medical collegelgmc 
government of indialgoi 

governments of indialgovernment of india 
gulf cooperation council countrieslgccc 
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gulf cooperation councillgcc 

habib bank limitedlhbl 

habib exchange companylhec 

hajj policylhaj policy 

hajj visalhaj visa 

hamriyah free zonelhfz 

haq parast grouplhpg 

hattar industrial estatelhie 

hazara qaumi mohazlhqm 

head constablelhc 

high commissionerslhigh commission 
high commissionerslhigh commissionlhc 
high commissionlhe 

high court bar associationlhcba 

high courtlhe 

hindustan timeslht 

house building finance corporationlhbfc 
house of commonslhouse of common 
house of lordslhouse of lord 

human development forumlhdf 

human development foundationlhdf 
human development fundlhdf 

human resource development networklhrdn 
human rights bureaulhrb 

human rights forumlhrf 

human rights foundationlhrf 

human rights watchlhrw 

hunza development forumlhdf 

hurriyat conferencelhc 

hurriyet conferencelhc 

illinois national guardsmanilillinois national guard 
independent press associationlipa 
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indus river system authoritylirsa 

indus water commissionerslindus water commission 
indus water commissionliwc 

ineternational cricket councillicc 
information communications technologylict 
information technology commerce networklitcn 
institute of business managementliobm 
intelligence bureaulib 

intelligence burueaulib 

inter service intelligencelisi 

inter services intelligence)lisi 

inter services intelligencelisi 

international air transport associationliata 
international air travel associationliata 
international airlines transport associationliata 
international atomic energy agencyliaea 
international boxing federationlibf 
international civil aviation organizationlicao 
international cricket conferencelicc 
international cricket councillicc 
international criminal courtlicc 

international development associationlida 
international finance corporationlifc 
international human rights commissionlihre 
international islamic universityliiu 
international labor organizationlilo 
international labour organisationlilo 
international labour organistaionlilo 
international labour organizationlilo 
international maritime bureaulimb 
international maritime organizationlimo 
international monetary fundlimf 
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international monitoring fundlimf 
international olympic committeelioc 
international olympic councillioc 
international olympics committeelioc 
international political forumlipf 
international relations committeelirc 
international rescue committeelirc 
international security forceslisf 
international watch organizationliwo 
international women mootliwm 

ipu) councillipu council 

islami jamhoori ittehadliji 

islamic society of statistical scienceslisoss 
islamic students leaguelisl 

israel defense forceslidf 

israeli defence forceslidf 

jamhoori watan partyljwp 

jammu and kashmir national panthers partyljaknpp 
jamshoro power companyljpc 

japan cultural associationsljapan cultural association 
japan emergency ngosljen 

jawaharlal nehru sports trustljnst 

jeay sindh qaumi mahazljsqm 

jeay sindh qaumi movementljsqm 

jeay sindh quami mahazljsqm 

jet propulsion laboratoryljpl 

jinnah postgraduate medical centerljpmec 
jiye sindh qaumi mahazljsqm 

justice departmentljustice dept 

kahota research laboratorieslkrl 

kahuta research laboratorieslkrl 
karakuram international universitylkiu 
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kashmiri american councillkac 
kashmiri scandinavian councillksc 
khan research laboratorieslkrl 
khushal pakistan programlkpp 
khushali banklkb 

khushhali banklkb 

khyber medical collegelkmc 

kinetic warheadlkw 

king edward medical collegelkemc 
kings partylking party 

korean broadcasting systemlkbs 
kuwait petroleum companylkpc 
kyrgyzstan caalkyrgyz caa 

land acquisition committeellac 

latif akbarlla 

layton rehmatulla benevolent trustllrbt 
legislative assemblylla 

lever brothers pakistan limitedllbpl 
lexus gssllexus gs 

liberal democratic partylldp 

liberal forum pakistanllfp 

library of congresslloc 

lions clubsllions club 

liquefied petroleum gasllpg 

local area networkllan 

local governmentsllocal government 
local governmentilg 

loya jirga commissionlljc 
management development serviceslmds 
marine rescue coordination centerlmrcc 
maritime security agencylmsa 
marylebone cricket clublmcc 


102 


mas freedom foundationlmas freedom foundation 
mehsuds of badarlmehsud of badar 

merv dillonlm dillon 

metallurgical construction companylmcc 
microsoft corporationlmicrosoft corp 
millat partylmp 

milli yakjehti councillmyc 

ministry of communications|ministry of communication 
ministry of defencelmod 

ministry of healthlmoh 

ministry of industries and productionslministry of industries and production 
ministry of informationlministry of i 
ministry of women developmentimowd 
minority advisory council punjablmacp 
mirpur development authoritylmda 

mna hostelsimna hostel 

mohajir qaumi movementImqm 

mutaheda qaumi movementlmqm 
mutahhida qaumi movementImqm 
mutahidda qaumi movementImqm 
muttaheda qaumi movementImqm 
muttahida jehad councillmjc 

muttahida qaumi movementiImqm 
muttahida quami movementimqm 
muttahida quomi movementImqm 
muttehida quami movementimqm 
national accountability bureaulnab 
national agricultural research centrelnarc 
national agricultural research councillnarc 
national agricultural research systems|nars 
national alliance;|national alliance 
national alliancehavelnational alliance 
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national awami party pakistanInapp 
national basketball associationInba 
national book foundationInbf 

national broadcasting corporationInbc 
national command authoritylnca 
national conferencelnc 

national construction limitedIncl 
national cricket academylnca 

national defence collegelndc 

national democratic allianceInda 
national development finance corporation|ndfc 
national drainage projectIndp 

national economic commission|nec 
national economic councillnec 

national education foundation|Inef 
national education policylnep 

national electoral commission|Inec 
national electric power regulatory authorityInepra 
national environmental action planIneap 
national fertiliser corporationInfc 
national fertilisers corporationInfc 
national fertilizer corporationInfc 
national finance commissionInfc 
national financial commissionInfc 
national fisheries development boardInfdb 
national food authorityInfa 

national frontinf 

national highway authoritylnha 

national human rights commissionInhre 
national indicative programelnip 
national institute healthInih 

national investment trustInit 
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national logistic celllnlc 

national olympic committeeInoc 

national people congressInpc 

national police academylnpa 

national police bureaulnational police burea 
national power construction companylnpcc 
national power construction corporationInpcc 
national reconstruction bureaulnrb 

national reconstuction bureaulnrb 

national security advisory boardInsab 
national security agencylnsa 

national security councillnsc 

national security forcelnsf 

national technology incubatorsInti 

national umpires councillnational umpire council 
national water policyInwp 

national workers partyInwp 

natural resource managementinrm 

naval armament depotinad 

neonatal tetanusint 

nepali congressinc 

network leasing corporation limitedInIcl 
new california medialncm 

new south walesinsw 

new york police departmentlnypd 

new york stock exchangelnyse 

new york timeslnyt 

new zealand cricketlnzc 

nh houselnh 

non government organizationsIngo 

non governmental organisationsIngo 

non governmental organizationIngo 
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north waziristan agencylnwa 

northern alliancelnorth alliance 

northern area legislative councillnalc 

nwfp agriculture departmentinwfp agri dept 
office management grouplomg 

oic conferenceloic co 

oil companies advisory committeelocac 

oil company advisory committeelocac 
oilfields limitedloilfield limited 

oilseed development boardloil development board 
overseas employment corporationloec 
overseas pakistan foundationlopf 

overseas pakistanis advisory councillopac 
overseas pakistanis foundationlopf 
overseas private investment corporationlopic 
overseas private investment corplopic 

pac advisory councillpac 

pacific news servicelpns 

pakhtoonkhawa milli awami partylpmap 
pakistan agricultural research councillparc 
pakistan agriculture research councillparc 
pakistan air forceslpaf 

pakistan air forelpaf 

pakistan american business associationlpaba 
pakistan american congresslpac 

pakistan american councillpac 

pakistan american democratic forumlpadf 
pakistan armed forceslpaf 

pakistan armed forceslpakistan armed force 
pakistan armed services boardlpasb 
pakistan atomic energy commission|paec 
pakistan bar councillpbe 
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pakistan boxing federationlpbf 

pakistan bridge federationlpbf 

pakistan broadcasting corporation|lpbc 

pakistan community serviceslpcs 

pakistan computer bureaulpcb 

pakistan cricket boardlpcb 

pakistan education foundationlpef 

pakistan educational research networklpern 
pakistan electrical fans manufacturers associationlpefma 
pakistan electronic media regulatory authoritylpemra 
pakistan engineering councillpec 

pakistan environment protection foundationlpepf 
pakistan girls guide associationlpgga 

pakistan golf federationlpgf 

pakistan green task forcelpgtf 

pakistan high commission|lphc 

pakistan housing authoritylpha 

pakistan human development fundlphdf 

pakistan human rights commission|phrc 

pakistan industrial development corporationlpidc 
pakistan international airlines corporation|piac 
pakistan international airlines|pia 

pakistan international human rights organization|pihro 
pakistan islamic medical associationlpima 
pakistan karate federation|pkf 

pakistan law commissionlplc 

pakistan medical associationlpma 

pakistan meteorological departmentlpmd 
pakistan microfinance networklpmn 

pakistan muslim leaguelpml 

pakistan national accreditation councillpnac 
pakistan national aids programlpnap 
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pakistan news serviceeditoriallpakistan news service 
pakistan news servicelpns 

pakistan olympic associationlpoa 

pakistan oppressed national alliance movementlponam 
pakistan ordnance factorylpof 

pakistan people partylppp 

pakistan petroleum dealers associationlppda 

pakistan petroleum limitedlppl 

pakistan pharmaceutical manufacturers associationlppma 
pakistan red crescent societylprcs 

pakistan science foundationlpsf 

pakistan security printing corporationlpspc 

pakistan software export boardlpseb 

pakistan sports boardlpsb 

pakistan squash federationlpsf 

pakistan standard institution|psi 

pakistan state oillpso 

pakistan steellps 

pakistan students associationslpakistan students association 
pakistan tehreek insaflpti 

pakistan tehrik insaflpti 

pakistan tele communication limited|ptcl 

pakistan telecom authoritylpta 

pakistan telecommunications authoritylpta 

pakistan telecommunications company limited|ptcl 
pakistan telecommunications corporation limited|ptcl 
pakistan tobacco boardlptb 

pakistan tourism development cooperationlptdc 
pakistan tourism development corporationlptdc 
pakistan trade centrelptc 

pakistan water partnershiplpwp 

pakistan welfare societylpws 
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pakistan workers federationlpwf 
palestinian broadcasting corplpbc 
palestinian liberation organisationlplo 
palestinian national authoritylpna 
peshawar high courtlphc 

pharmaceutical industrylpharma industry 
pilotless target aircraftlpta 

political action committeelpac 

Ppp punjablpp 

privatisation commissionlpc 

professional squash associationlpsa 
progressive state oillpso 

progressive women associationlpwa 
provincial finance commission|pfc 
provincial public safety commission|lppsc 
public adhoc committeelpac 

public affairs committeelpac 

public information departmentlpid 

public offering of ssgcllpublic offering of ssgc 
pukhtoonkhwa milli awami partylpmap 
punjab health foundation|phf 

punjab olympic associationlpoa 

punjab provincial cooperative bank limited|ppcbl 
punjab small industries corporationlpsic 
punjab squash associationlpsa 

punjab tourism development corporationlptdc 
punjab universitylpu 

qaumi jamhoori partylqjp 

qaumi jamhrooi partylqjp 

qaumi tajir ittehadlqti 

radio organizationslradio organization 
rashtriya swayam sevaklrss 
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rashtriya swayamsevak sanghlrss 
rashtriya swayamsewak sanghlrss 
rawalpindi cricket associationIrca 
rawalpindi medical collegelrmc 
red crescent authoritylrca 
red crescent societylrcs 
red crosslrc 
regional accountability bureaulrab 
regional development finance corporationlrdfc 
regional organizationslregional organization 
rehman medical institutelrmi 
religious organizationslreligious organization 
republican guardslrepublican guard 
revolutionary united frontlruf 
rohri limitedlrohri limited 
royal air forcelraf 
royal canadian mounted policelrcemp 
royal united services institutelrusi 
rural health centerlrhe 
rural social development programlrsdp 
saarc human resource development centrelshrdc 
saarc information centrelsic 
saarc summitslsaarc summit 
saf media celllsaf media cel 
sarhad development authoritylsda 
saudi electric companylsaudi electric com 
saudi press agencylspa 
saudi royal air forceslsraf 
save) committeelsave committee 
science and technology;lscience and technology 
sciences and technologylscience and technology 
sdpi) study grouplsdpi study group 
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security councilalsecurity council 

security forces (bsf)lsecurity force (bsf) 
shalimar television networklstn 

shangahi cooperation orgnaizatonlsco 

shanghai cooperation organization|sco 

shaukat khanum memorial cancer hospitallskmch 
shaukat khanum memorial trustlskmt 

shiromani gurdwara parbandhak committeelsgpc 
sichuan petroleum administrationlspa 

sindh adabi boardlsab 

sindh agriculture university teachers associationlsauta 
sindh amateur cycling associationlsaca 

sindh assemblylsa 

sindh communist partylscp 

sindh democratic alliancelsda 

sindh engineering limitedlsel 

sindh environmental protection agencylsindh environment protection agencylsepa 
sindh high courtlshe 

sindh information technology boardlsitb 

sindh medical collegelsmc 

sindh national frontlsnf 

sindh services hospitallsindh service hospitallssh 
sindh taraqqi passand partylstpp 

sindh universitylsu 

singapore electronic systemlses 

singapore international airlineslsia 

small business finance corporationlsbfc 

small industrial development boardlsidb 

social action programlsap 

south asian federationlsaf 

south asian free media associationlsafma 

south asian olympic councillsaoc 
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south asian sports federationlsasf 

south waziristan agencylswa 

special air servicelsas 

special boat squadronlsbs 

special communication organisation|sco 
special communication organizationlsco 
special communications organization|special communication organizationlsco 
special economic zonelsez 

special operation grouplsog 

special operations grouplsog 

special task forceslstf 

special task forcelstf 

sports board punjablsbp 

sri gurdwara prabandhak boardlsgpb 
state departmentlstate dept 

state engineering corporationlsec 

state human rights commissionlshre 
strategic forces commandlsfc 

sui southern gas company limitedIssgcl 
sui southern gas companylssgc 
supreme court bar associationlscba 

taj companyltaj co 

talibanltaliban 

taraqee foundationltf 

technology development fundltdf 
tehsil nazimltn 

telugu desam partyltdp 

thai parliamentarianlthai parliament 
the citizen foundationltcf 

total medial limitedltotal media limited 
trade development agencyltda 

trade union congressltuc 
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transportation divisionltransport division 

turkish air forceslturkish air force 

turkish navallturk naval 

turkish presidentiallturkish president 

union councilluc 

united airlineslua 

united bank limitedlubl 

united christian frontlucf 

united cricket boardlucb 

united frontluf 

united jehad councillujc 

united national partylunp 

united nationslun 

united nations development programlundp 

united nations development projectlundp 

united nations disaster management teamlundmt 

united nations industrial development organisationlunido 
united nations industrial development organizationlunido 
united nations information centerlunic 

united nations international children educational fundlunicef 
united nations international children emergency fundlunicef 
united nations organizationluno 

united nations security councillunsc 

united states administrationlusa 

united states air forcelusaf 

us agriculture departmentlus agri dept 

us congresslus cong 

us governmentslus government 

us state departmentlus state dept 

utility stores corporationlusc 

voice of americalvoa 

wall street journallwsj 
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washington postlwp 

water and power development authoritylwapda 
water and sanitation agencylwasa 
water resources research institutelwrri 
west indies cricket boardlwicb 
wetlands projectlwetland project 
women cricket associationlwca 
women development foundationlwdf 
women health projectlwhp 

women in technologylwit 

workers welfare fundlwwf 

world banklwb 

world boxing councillwbc 

world boxing organisationlwbo 
world craft councillwcc 

world cricket action committeelwcac 
world cuplwc 

world economic forumlwef 

world economics forumlworld economic forumlwef 
world food programlwfp 

world gold councillwgc 

world health assemblylwha 

world squash federation|wsf 

world tourism organisation|wto 
world tourism organization|wto 
world trade organisation|wto 

world trade organizationlwto 

world trade organizatonlwto 

world wildlife fundlwwf 

xiv asiadIxi asiad 

yum restaurants internationallyri 
zarai taraqiati bank limitedl|ztbl 
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zarai taraqqiati bank limitedl|ztbl 
ziauddin medical universitylzmu 


zone vilzone v 


LOCATIONS 


abaro,sindh 
abasna,india 
abbattobad, pakistan 
abbotabad,bannu 
abbotabad, pakistan 
abbottabad, pakistan 
abottabad, pakistan 
african,afghanistan 
agam,afghanistan 
agra,india 
ahmedabad, india 
akron,ohio 
alabama,usa 
alameda,california 
alexandria,egypt 
alexandria,va 
alexandria, virginia 
algiers,algeria 
almatay,kazakhstan 
almaty, kazakhstan 
ames,iowa 
amman,jordan 
amritsar,india 
amstelveen,netherlands 
amsterdam,holland 
amsterdam,netherlands 


anaconda,alamo 
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andijan,uzbekistan 
ankara,turkey 
antwerp,belgium 
arachi,pakistan 
arequipa,peru 
argentineans,india 
arin,bandipora 
arkansas,usa 
arlington, virginia 
asean,laos 
aseer,tabuk 
asgabat,turkmenistan 
asghabat,turkmenistan 


ashdod, israel 


ashghabad,turkmenistan 


assam,iaf 
assemblies, pakistan 
astore,pakistan 
athens, greece 
atlanta,georgia 
attock,pakistan 
aurora,colorado 
austin,texas 

avenue, brooklyn 
ayodhya,india 
ayubia,pakistan 
badin,pakistan 
bagram, afghanistan 
bahawalnagar, pakistan 
bahawalpur,pakistan 
bahawalpur,rs 


baisa,iraq 
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bakersfield,ca 
baku,azerbaijan 
balakot,pakistan 
bali,indonesia 
balochistan,pakistan 
bam, iran 
bamako,mali 
bamyan, afghanistan 
bandipora,baramulla 


bandipora,rajouri 


bandypoora,budhwar 


bangalore,india 
bangkok,thailand 
bangok, thailand 
bannu,dikhan 
baqoubah, iraq 
barcelona,spain 
barmer,bikaner 
basel,switzerland 
basha,skardu 
batagram,pakistan 
beaumont,texas 
beijing,china 
beirut,lebanon 
berkeley,california 
bermingham,uk 
bhakkar,pakistan 
bhalwal,sargodha 
bhawalpur,pakistan 
bhopal,india 
bhuj,india 


bhurban,murree 
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bhurban, pakistan 
bhuttan,india 
bicester,oxfordshire 
biermingham,uk 
bingol,turkey 
birmingham,uk 
bise,peshawar 
blain,williamson 
boao,china 
bogota,colombia 
boise,idaho 
bolan,sibi 

bombay, india 
bonavish,balkh 
bonn,germany 
bonn,western 
boston,ma 
boston,massachusetts 
boumerdes, algeria 
brdelas, gilgit 
bridgetown, barbados 
bridgeview, il 
brighton,england 
brisbane, australia 
brisbane,queensland 
bsa,ifpi 

bsp,ncp 

budapest, hungary 
budhal,rajouri 
bulawayo,zimbabwe 
burewala,pakistan 
busan,korea 
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cairo,egypt 


california,usa 


cambridge,massachusetts 


cambridge,uk 
canada,north america 
cancum,mexico 
cancun,mexico 
canterbury,uk 
canton,ohio 
cardiff,england 
carrara, italy 
casablanca,morocco 
cern,geneva 
cerritos,ca 
chagai,pakistan 
chakala, pakistan 
chaklala,pakistan 
chakwal,pakistan 
chaman,pakistan 
chamman,baluchistan 
chandigarh,india 
chandpur,bangladesh 
chantilly,france 
charsadda,bannu 
charsadda,pakistan 
charsadda,sanam 
chashma,mianwali 
chattragul,kangan 
chennai,india 
cherbourg,france 
chicago,usa 


chichawatni,pakistan 
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china,asia 
chinar,pakistan 
chishtian,pakistan 
chitagong,bangladesh 
chithisinghpora,kashmir 
chitral,pakistan 
cologne,germany 
columbia,maryland 
columbia,missouri 
columbus, georgia 
columbus,ohio 
comparatively,pakistan 
compton,ca 
comstech,idb 
copenhagen,denmark 
cortina, italy 
crawford,texas 
cudahy,ca 
dade,broward 
dadu,pakistan 
dakar,senegal 
dallas,texas 
damascus, syria 
dammam,riyadh 
dar,reserve 

dashkin, pakistan 
daska, pakistan 
daudkhel,mianwali 
dc,usa 
debkafile,india 
deceitfully,india 
dehli,india 
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dehrawad,afghanistan 
delhi,india 
delhi,sagroor 
deoband,india 
detroit,mi 
detroit,michigan 
dha,rawalpindi 
dhaka,bangladesh 
dhamtore,abbottabad 
dhol, baja 
diplo,pakistan 
diplomatically,india 
doda,rajouri 
doha,gatar 
dover,delaware 
dubai,uae 
dunyapur,pakistan 
dushanbe, tajikistan 
edmonton,canada 
encino,california 
eriteria,ghana 
eugene,oregon 
evian,france 
excellencies,pakistan 
fairfax,virginia 
faisalabad,pakistan 
faislalabad,pakistan 
falfurrias,texas 
fallujah,iraq 
faryab,kunduz 
fata,islamabad 


fata,pakistan 
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fata,paksat 
fata,peshawar 
fem,nawabshah 
fia,karachi 
ficci,fpcci 
flemington,nj 
florida,us 
florida,usa 
fns,germany 
frankfurt, germany 
fremont,california 
french,russia 
fukuoka,japan 
gabba,brisbane 
galyat,mansehra 
gandawah,baluchistan 
ganderbal,achabal 
gardez,afghanistan 
gawadar,meerani 
gawadar,pakistan 
gazaryar,kupwara 
geneva,switzerland 
genoa,italy 
georgetown,guyana 
ghangche, italy 
ghangche,pakistan 
ghanghce,pakistan 
ghazi,tarbela 
ghent,belgium 
ghotki,jacobabad 
ghotki,sukkur 
ghottki,jacobabad 
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gifu,japan 
gilgit,pakistan 
glasgow,scotland 
gniezno,poland 
godhra, india 
gorakhpur,india 
gorny,russia 
gpo,islamabad 
green,pakistan 
guadalajara,mexico 
guangdong,china 
guangnan,china 
guantanamo,cuba 
gujranwala,faisalabad 
gujranwala,gujrat 
gujranwala,pakistan 
gujrat,pakistan 
gultari,northern 
guwahati,india 
gwadar,pakistan 
gwaliar,india 
gwangju,incheon 
gwangju,korea 
hafizabad,pakistan 
hainan,china 
hajj,jeddah 
halabja,iraq 
hamburg,germany 
hangu,sharki 
hansera,india 
haripur,pakistan 
harlan,iowa 


123 


haroonabad,pakistan 
harpsund,sweden 
hartford,connecticut 
hassanabdal, pakistan 
hattian,attock 
hayatabad,peshawar 
hellar,kokernage 
helmand, afghanistan 
helmand,uruzgan 
herat,bamiyan 
hilton,colombo 
hiroshima,japan 
hollywood,florida 
hongkong,thailand 
hospital,chitral 
houston,texas 
hyderabad, district 
hyderabad, india 
hyderabad, pakistan 
ibrat, hyderabad 
igp,sindh 
illinois,usa 

ilo,undp 

indus, gilgit 
interprise,sukkur 
inzaman,pakistan 
ioskom,turkey 
ipoh,malaysia 
ipps,wapda 
isalamabad, pakistan 
isamabad,pakistan 


isi,pakistan 
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iskandariyah,iraq 
islamabad, pakistan 
islamabad,pakitan 
islamabad,patten 
islambad,pakistan 
islmabad,pakistan 
istanbul,turkey 
iucn,pakistan 
izmir,turkey 
jaccobabad, pakistan 
jacobabad, pakistan 
jaffarabad, pakistan 
jaglot,pakistan 
jakarat,indonesia 
jakarta,indonesia 
jalalabad, afghanistan 
jalozai,pakistan 
jauharabad, pakistan 
jauhrabad, pakistan 
jawans,naval 
jeddah,saudi 
jehlum, pakistan 
jenin,ramallah 
jhang,multan 
jhang,pakistan 
jhudo, pakistan 
jiuquan,gansu 
jJkpp,ppp 
joharabad, pakistan 
jomtien, thailand 
jordan, israel 
jowzjan,sar 
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juharabad, pakistan 
jullundur,india 
kabul, afghanistan 
kaduna,nigeria 
kafaragua,switzerland 
kakul,pakistan 
kanadahar, afghanistan 
kananaskis, alberta 
kandahar, afghanistan 
karachi,pakistan 
karahci,quetta 
karak,kohat 
kasur,pakistan 
kathmandu,nepal 
katich, bracken 
kauai,hawaii 
kazapunga,wana 
khairpur,pakistan 
khairpur,sindh 
khairpur,sukkur 
khanewal,pakistan 
khanghar,pakistan 
khanpur,nasirabad 
khanpur,pakistan 
kharan,kalat 
kharan,pakistan 
kharian,pakistan 
khartoum,sudan 
kheri,aarpinchla 
kiev,ukraine 
kingsmead,durban 


kingston,jamaica 
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kita,japan 
kiwis,pakistan 
koenigswinter, germany 
kohala,pakistan 
kohat,pakistan 
kohlu,sui 
kolgam,islamabad 
kong,dubai 
korollelvu, fiji 
kotli,pakistan 
krakow,poland 
kulgam,islamabad 
kumble,india 
kundi,kupwara 
kunduz,bamiyan 
kyoto,japan 
laeken,belgium 
lagos,nigeria 
lahore,pakistan 
larkana,dadu 
larkana,jacobabad 
larkana,pakistan 
larkana,shikarpur 
larkana,sibi 
larkana,sindh 
larkana,sukkur 
larnaca,cyprus 
lausanne,switzerland 
layyah,pakistan 
Icci,malaysian 
leeds,england 
leicester,uk 
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lillee,england 
liloan,philippines 
lima,peru 
linkoping,sweden 
lisbon, portugal 
logar,khost 
logar,uruzgan 
lomita,ca 
london,england 
london,uk 
lords,england 
lucknow, india 
ludhiana,india 


luna,chitral 


madison,wisconsin 


madras,india 
madras,rss 
madrid,spain 
makung,taiwan 
male,maldives 
malikwal,pakistan 
manama,bahrain 


manama,us 


manchester,england 


manchester,uk 
mandra,pakistan 
mangla,pakistan 
mango,citrus 
manila,philippines 
manshera,pakistan 
mardan,kohat 


mardan,pakistan 
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mashad,iran 
mashud,iran 
masjid,india 
masnaa,lebanon 
mazar, jalalabad 
melawa,afghanistan 
melbourne, australia 
memphis, tennessee 
merida,mexico 
miami,florida 
mianwali,pakistan 
middlesbrough,england 
midland,texas 
mills,pakistan 
minfal,cotton 
mingora,pakistan 
mirpur,muzaffarabad 
mirpur,pakistan 
mirpurkhas, pakistan 
mithi,pakistan 
mobile,alabama 
modena,italy 
monterrey,mexico 
montgomery,alabama 
montreal,canada 
moscow, russia 
mosul, iraq 

mou, pakistan 
multan,pakistan 
mumbai,india 
munich, germany 


mutree,pakistan 


129 


muslimabad,kohat 
muzafargarh,pakistan 
muzaffarabad,pakistan 
muzaffargarh,pakistan 
naantali,finland 
nacogdoches,texas 
nagpur,india 
nahrin,afghanistan 
nairobi,kenya 
najaf,iraq 
najaf,karbala 
nakial,pakistan 
nankana,pakistan 
narawara,eidgah 
narowal,pakistan 
nashville,tn 
nasiriyah,iraq 
nastogaz,ukraine 
nawabshah,pakistan 
nazimabad,karachi 
nevada,usa 
newark,california 
newlands,capetown 
newton,massachusetts 
newyork,tokyo 
newyork,usa 

ngos, afghanistan 
nha,gwadar 
nimmud,leh 
nlc,rawalpindi 
noonday,texas 


norfolk, virginia 
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northridge,ca 
northrop,ca 
nottingham,england 
nottingham,uk 
nowshera, pakistan 
nyon,switzerland 
oakland,california 
ocona,peru 
odense,denmark 
oita,japan 
orlando, florida 
osaka,japan 
oslo,norway 
otocari,bosnia 
ottawa,canada 
paf,pakistan 
paigham,sahiwal 
pakistan,asia 
pakpattan,pakistan 
paktia,khost 
palestine,texas 
panaji,india 
paramaribo,suriname 
paris,france 
paro,bhutan 
pashtoons,afghanistan 
pashtuns, afghanistan 
pennsylvania,usa 
perth, australia 
peshawar,pakistan 
peshwar,pakistan 
peterborough,england 
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phagvel,india 
philadelphia,pennsylvania 
philadelphia,usa 
phoenix,arizona 
piedar,islamabad 
pims,polyclinic 
piplan,mianwali 
pishin,quetta 
pitham,karachi 
pittsburgh,pa 
pittsburgh,pennsylvania 
pittsfield, massachusetts 
pomona,ca 
poonch,rajouri 
portland,oregon 
pota,india 
prgf,pakistan 
proteas, pakistan 
pseb,sindh 
punjab,sindh 

punjab, india 

punjab, province 
pwd,quetta 
qadirpur,kadanwari 
qala,afghanistan 
qalat,afghanistan 
qandahar, afghanistan 
qatari,azerbaijan 
quetta,pakistan 
rabat,morocco 
rafiganj,india 
ragistan,india 
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raipur,india 
raiwind, pakistan 
rajasthan,india 
rajasthan, tihar 
ramstein,germany 
rawalpindi,pakistan 
rawlakot,pakistan 
reading,pennsylvania 
reston, virginia 
richardson,texas 
riga, latvia 
risalpur,pakistan 
riverside,ca 
rochester,ny 

rome, italy 

roses, pakistan 
rosoboronexport,india 
rotherham,england 
rotterdam,holland 
rss,india 
ruwani,afghanistan 
sacramento,california 
saddar,karachi 
sadiqabad,pakistan 
safta,pakistan 
sahahar,basra 
sahiwal,pakistan 
sailkot,sindh 
saindak,pakistan 
sakardu,hunza 
sanghar,pakistan 
sanglahill,pakistan 
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saraj,afghanistan 
sargodha,pakistan 
sargodha,rs 
sarhad,abbottabad 
sarobi,afghanistan 
sarriab,quetta 
satkhira,jessore 
scottsdale,arizona 
sehwan,pakistan 
seibersdorf,austria 
shakargarh,pakistan 
shanghai,china 
shankiari,pakistan 
shanksville,us 
sharjah,uae 
shatoi,chechnya 
sheffield,england 
sheikhupura,pakistan 
shimla,tashkent 
shizuoka,japan 
shkargarh,pakistan 
sialkot,pakistan 
sialkot,rs 
sibi,pakistan 
sigonella, italy 
siliguri,india 
skardo,pakistan 
skardu,pakistan 
skims,soura 
slamabad, pakistan 
smeda,pakistan 


snamprogetti,italy 
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soham,england 
springfield, virginia 
srilanka,algeria 
srinagar, india 
sringar,kashmir 
stephenville,texas 
stn,shalimar 
stockholm,sweden 
strangely,india 
strasbourg,france 
stuttgart,germany 
suitland,maryland 
sujaat,sindh 
sukkur,pakistan 
sunnyvale,california 
sust,pakistan 
suwon,korea 
sydney, australia 
tacoma,washington 
talagang,pakistan 
taleban, afghanistan 
tampa, florida 
tangier,morocco 
tarnab,peshawar 
tarragona,spain 
tashkent,uzbekistan 
taxila,pakistan 
tehran,iran 
tehsildar,abbottabad 
texas,usa 
thimpu, bhutan 
thirdly,india 
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ti,india 
tikrit,iraq 
tokyo,japan 
toronto,canada 
township,domail 
trieste,italy 
tripoli,libya 
tufnell,croft 
turin,italy 
tustin,ca 
uav,india 
ubl,pakistan 
udine, italy 
ueberlingen,germany 
ultan,pakistan 
unfpa,italy 
unido,iib 
united states,north america 
urumdi,china 
vacaville,california 
valencia,spain 
valladolid,spain 
vancouver,canada 
venice, italy 
verrettes, haiti 
vettori, warne 
vienna,austria 
vienna,va 
vientiane,laos 
vilseck, germany 
waddington,england 
wagah, pakistan 
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wah,pakistan 
wana,pakistan 
warcha,kalabagh 
washington dc,united states 
wasington,united states 
waziristan, pakistan 
westmont,illinois 
weybridge,uk 
whistle,pakistan 
windies,india 
xian,china 
xinjiang,china 
yeman,zimbabwe 
yogyakarta,indonesia 
yokohama, japan 
zairat,kalat 
zanakote,hmt 
zangalopora,islamabad 
zhejiang,china 
zhob,pakistan 
ziarat,pakistan 


zte,pakistan 
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APPENDIX B 


SUPPORT CLASSES JAVADOC 


1. Class FileHandler 


java.lang.Object 
L thesis. FileHandler 
All Implemented Interfaces: 


qdxml.DocHandler 


public class FileHandler 
extends java.lang.Object 
implements qdxml.DocHandler 
Author: 


Matthew W. Esparza 


Constructor Summary 


FileHandler () 


Method Summary 


void close () 
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void 


void 


java 


java 


int 


java 


void 


java 


void 


java 


void 


void 


void 


.lang.String 


.lang.String 


.lang.String 


-lang.String[] 


-lang.String 


endDocument () 


endElement (java.lang.String elem) 


getDirectoryName () 


getFileName () 


getLineCount () 


getPath () 


init (java.lang.String fileName) 


listDirectoryContents () 


openDirectory () 


openFileChannel (java.lang.String specFile) 


setLineCount (int num) 


startDocument () 


startElement (java.lang.String elem, 
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java.util.Hashtable h) 


void text (java.lang.String text) 








Methods inherited from class java.lang.Object 


equals, getClass, hashCode, notify, notifyAll, toString, wait, 


wait, wait 











Constructor Detail 








FileHandler 


public FileHandler() 


Method Detail 





init 


public void initGava.lang.String fileName) 


close 
public void close() 
throws java.io. L[OException 


Throws: 





java.io. IOException 


openFileChannel 
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public java.lang.String openFileChannel(java.lang.String specFile) 


openDirectory 


public void openDirectory() 


listDirectoryContents 


public java.lang.String[] listDirectoryContents() 


getFileName 


public java.lang.String getFileName() 


getPath 


public java.lang.String getPath() 


getDirectoryName 


public java.lang.String getDirectoryName() 


setLineCount 


public void setLineCount(int num) 


getLineCount 


public int getLineCount() 


startDocument 
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public void startDocument() 

throws java.lang.Exception 
Specified by: 
startDocument in interface gdxml.DocHandler 


Throws: 





java.lang.Exception 


endDocument 
public void endDocument() 
Specified by: 


endDocument in interface qdxml.DocHandler 


startElement 
public void startElement(java.lang.String elem, 
java.util. Hashtable h) 


Specified by: 





startElement in interface qdxml.DocHandler 


endElement 
public void endElement(java.lang.String elem) 


Specified by: 





endElement in interface gdxml.DocHandler 


text 
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public void text(java.lang.String text) 
Specified by: 


text in interface qdxml.DocHandler 


2. Class Person 


java.lang.Object 


_ thesis.Person 


public class Person 
extends java.lang.Object 
Author: 


Matthew W. Esparza 





Constructor Summary 





Person () 











Method Summary 

void addActions (java.lang.String str) 
void addAssoc (java.lang.String str) 
void addCity (java.lang.String str) 
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void 


void 


void 


void 


void 


void 


void 


void 


void 


void 


void 


void 


addContent (java.lang.String str) 


addCountry (java.lang.String str) 


addEnemies (java.lang.String str) 


addFamily (java.lang.String str) 


addFriends (java.lang.String str) 


addLocation (java.lang.String str) 


addName (java.lang.String str) 


addOrganization (java.lang.String str) 


addQuotes (java.lang.String str) 


addReference (java.lang.String str) 


addTimelineData (java.util.Calendar date) 


addTitle (java.lang.String str) 
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void 


java. 


java. 


java. 


java. 


java. 


java. 


java. 


java. 


java. 


java. 


java. 


java. 


util 


util 


util 


util 


util 


util 


util. 


util 


util 


util 


util 


util 


addTombstoneData (java.lang.String str) 


.-LinkedList getActions () 
.LinkedList getAssocs () 
.-LinkedList getCity () 
.LinkedList getContent () 
.LinkedList getCountry () 
.LinkedList getEnemies () 
LinkedList getFamily () 
.LinkedList getFriends () 
.LinkedList getLocation () 
.regex.Pattern getNamePattern () 
.LinkedList getNames () 
.-LinkedList getOrganizations () 


146 











wait, 


java.util.LinkedList getQuotes () 


java.util.LinkedList getReferences () 

java.util.LinkedList get TimelineData () 

java.util. LinkedList getTitles () 

java.util.LinkedList get TombstoneData () 

boolean isVoid () 

java.lang.String[] removeRedundancy (java.lang.String[] array) 
void setNamePattern (java.lang.String str) 


Methods inherited from class java.lang.Object 


equals, getClass, hashCode, notify, notifyAll, toString, wait, 


wait 





Constructor Detail 





Person 
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public Person() 


Method Detail 
is Void 


public boolean isVoid() 


addName 


public void addName(java.lang.String str) 


getNames 


public java.util.LinkedList getNames() 


addTitle 


public void addTitle(java.lang.String str) 


getTitles 


public java.util.LinkedList getTitles() 


addTombstoneData 


public void addTombstoneData(java.lang.String str) 


getTombstoneData 


public java.util.LinkedList getTombstoneData() 
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addTimelineData 


public void addTimelineData(java.util.Calendar date) 


getTimelineData 


public java.util.LinkedList getTimelineData() 


addFamily 


public void addFamily(java.lang.String str) 


getFamily 


public java.util.LinkedList getFamily() 


addFriends 


public void addFriends(java.lang.String str) 


getFriends 


public java.util.LinkedList getFriends() 


addAssoc 


public void addAssoc(java.lang.String str) 


getAssocs 


public java.util.LinkedList getAssocs() 
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addEnemies 


public void addEnemies(java.lang.String str) 


getEnemies 


public java.util.LinkedList getEnemies() 


addQuotes 


public void addQuotes(java.lang.String str) 


getQuotes 


public java.util.LinkedList getQuotes() 


addActions 


public void addActions(java.lang.String str) 


getActions 


public java.util.LinkedList getActions() 


addCountry 


public void addCountry(java.lang.String str) 


getCountry 


public java.util.LinkedList getCountry() 
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addCity 


public void addCity(ava.lang.String str) 


getCity 


public java.util.LinkedList getCity() 


addLocation 


public void addLocation(java.lang.String str) 


getLocation 


public java.util.LinkedList getLocation() 


addOrganization 


public void addOrganization(java.lang. String str) 


getOrganizations 


public java.util.LinkedList getOrganizations() 


addReference 


public void addReference(java.lang.String str) 


getReferences 


public java.util.LinkedList getReferences() 
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addContent 


public void addContent(java.lang.String str) 


getContent 


public java.util.LinkedList getContent() 


setNamePattern 


public void setNamePattern(java.lang.String str) 


getNamePattern 


public java.util.regex.Pattern getNamePattern() 


removeRedundancy 


public java.lang.String[] removeRedundancy(java.lang.String[] array) 
J Class TupleArray 


java.lang.Object 


L thesis. TupleArray 


public class TupleArray 
extends java.lang.Object 
Author: 


Matthew W. Esparza 
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TupleArray (int num) 





void addString (java.lang.String s) 
java.lang.String argMax () 

java.lang.String argMin () 

double calcPer (double num) 

int getCount (int index) 

int getiIndex (java.lang.String s) 
double getPercent (int index) 

double getSize() 

java.lang.String getString(int index) 
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wait, 


void incrementCount (java.lang.String s) 





double logProb (double percentAsDec) 

void removeString (java.lang.String s) 

void setCount (int num, int index) 
void setPercent (double num, int index) 
void setString (java.lang.String s, int index) 
boolean valuePresent (java.lang.String s) 


Methods inherited from class java.lang.Object 


equals, getClass, hashCode, notify, notifyAll, toString, wait, 


wait 





Constructor Detail 





TupleArray 


public TupleArray(int num) 


Method Detail 
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getSize 


public double getSize() 


addString 


public void addString (java.lang.String s) 


removeString 


public void removeString(java.lang.String s) 


valuePresent 


public boolean valuePresent(java.lang.String s) 


incrementCount 


public void incrementCount(java.lang.String s) 


setString 
public void setString(java.lang.String s, 


int index) 


getString 


public java.lang.String getString(int index) 


getIndex 


public int getIndex(java.lang.String s) 
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setCount 
public void setCount(int num, 


int index) 


getCount 


public int getCount(int index) 


argMax 


public java.lang.String argMax() 


argMin 


public java.lang.String argMin() 


setPercent 
public void setPercent(double num, 


int index) 


getPercent 


public double getPercent(int index) 


calcPer 


public double calcPer(double num) 
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logProb 


public double logProb(double percentAsDec) 


B. INFORMATION EXTRACTION JAVADOC 


iP Class Ordering 


java.lang.Object 


L Ordering 


public class Ordering 
extends java.lang.Object 
Author: 


Matthew W. Esparza 


Constructor Summary 





Ordering() 













Method Summary 


static void init(java.lang.String[] args) 
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Methods inherited from class java.lang.Object 


equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait 


Constructor Detail 


Ordering 

public Ordering() 

Method Detail 

init 

public static void init(java.lang.String[] args) 
Parameters: 


args — 
2. Class ReferenceCounter 


java.lang.Object 
L ReferenceCounter 
All Implemented Interfaces: 


qdxml.DocHandler 


public class ReferenceCounter 
extends java.lang.Object 
implements qdxml.DocHandler 
Author: 

Matthew W. Esparza 
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Constructor Summary 


ReferenceCounter () 


Method Summary 


void 


void 


static 


static 


static 


static 


static 


static 


static 


int 


java.lang.String 


void 


void 


java.lang.String[] 


void 


void 
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endDocument () 


endElement (java.lang.String elem) 


getCounter () 


getVal () 


init (java.lang.String[] args) 


processNumbers (java.lang.String str) 


retDocList () 


setCounter (int num) 


setVal (java.lang.String str) 








wait, 


void startDocument () 


void startElement (java.lang.String elem, 


java.util.Hashtable h) 


void text (java.lang.String text) 


Methods inherited from class java.lang.Object 


equals, getClass, hashCode, notify, notifyAll, toString, wait, 


wait 





Constructor Detail 





ReferenceCounter 


public ReferenceCounter() 


Method Detail 





startDocument 


public void startDocument() 

throws java.lang.Exception 
Specified by: 
startDocument in interface gdxml .DocHandler 
Parameters: 


null - 


160 


Throws: 





java.lang.Exception 


endDocument 

public void endDocument() 

Specified by: 
endDocument in interface qdxml.DocHandler 
Parameters: 


null - 


startElement 
public void startElement(java.lang.String elem, 
java.util. Hashtable h) 


Specified by: 





startElement in interface qdxml.DocHandler 
Parameters: 


String - elem, Hashtable h 


endElement 
public void endElement(java.lang.String elem) 


Specified by: 





endElement in interface gdxml.DocHandler 
Parameters: 


String - elem 
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text 

public void text(java.lang.String text) 
Specified by: 

text in interface qdxml.DocHandler 
Parameters: 


String - text 


setVal 
public static void setVal(java.lang.String str) 
Parameters: 


String - Str 


getVal 

public static java.lang.String getVal() 
Parameters: 

null - 

Returns: 


void 


processNumbers 
public static void processNumbers(java.lang.String str) 
Parameters: 


String - Str 
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retDocList 

public static java.lang.String[] retDocList() 
Parameters: 

null - 

Returns: 


String[] 


setCounter 
public static void setCounter(int num) 
Parameters: 


int -num 


getCounter 

public static int getCounter() 
Parameters: 

null - 

Returns: 


int 


init 
public static void initGjava.lang.String[] args) 
Parameters: 


String - args[] 
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3. Class Searcher 


java.lang.Object 


__ Searcher 


public class Searcher 
extends java.lang.Object 
Author: 


Matthew W. Esparza 


Constructor Summary 





Searcher () 

Method Summary 

static void init (java.lang.String[] args) 

static void marchThrough (java.lang.String[] files) 








Methods inherited from class java.lang.Object 


equals, getClass, hashCode, notify, notifyAll, toString, wait, 


wait, wait 
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Constructor Detail 


Searcher 


public Searcher() 
Method Detail 


marchThrough 
public static void marchThrough(java.lang.String[] files) 
Parameters: 


files - 


init 

public static void init(Gjava.lang.String[] args) 
throws java.lang.Exception 

Parameters: 

args - 


Throws: 





java.lang.Exception 
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KEYWORD EXTRACTION JAVADOC 


i® Class KeywordExtractor 


java.lang.Object 


L SenFil4.KeywordExtractor 


public class KeywordExtractor 


extends java.lang.Object 


Constructor Summary 


KeywordExtractor () 








Method Summary 


static void collectFamily (java.lang.String sentence, 


java.lang.String firstName, 





java.lang.String lastName, thesis.Person p) 


static void collectFriends (java.lang.String sentence, 


thesis.Person p) 


static void collectName (java.lang.String sentence, 


java.lang.String firstName, 





java.lang.String lastName, thesis.Person p) 
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static 


void 


static boolean 


static 


static 


static 


static 


static 


static 


static 


static 


static 


void 


void 


void 


void 


void 


void 


void 


java 


void 


-lang.String[] 


collectOrganizations (java.lang.String sentence, 


thesis.Organization currOrg, 


thesis.Person p) 


collectQuote (java.lang.String sentence, 


thesis.Person p) 


java.] 


java.] 


collectTitle (java.lang.String sentence, 


lang.String firstName, 


lang.String lastName, 





java.] 
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lang.String currTitle, 


decideLocation (thesis.Person p) 


decideName (thesis.Person p) 


thesis.Person p) 


decideOrganization (thesis.Person p) 


decideTitle(thesis.Person p) 


initLocationPatterns (java.lang.String[] locs) 


initOrganPatterns (java.lang.String[] orgs) 


IsolateEntities (java.lang.String fileContent) 


main (java.lang.String[] 


args) 


Methods inherited from class java.lang.Object 


equals, getClass, hashCode, notify, notifyAll, toString, wait, 


wait, wait 


Constructor Detail 


KeywordExtractor 


public KeywordExtractor() 
Method Detail 


initOrganPatterns 


public static void initOrganPatterns(java.lang.String[] orgs) 


initLocationPatterns 


public static void initLocationPatterns(java.lang.String[] locs) 


IsolateEntities 

public static java.lang.String[] IsolateEntities(java.lang.String fileContent) 
Parameters: 

Takes - no parameters. 

Returns: 


Returns an array of entity names and the sentences for each entity. 
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collectName 

public static void collectName(java.lang.String sentence, 
java.lang.String firstName, 
java.lang.String lastName, 
thesis.Person p) 

Parameters: 


String[] - array -- Sentences from file. Person p -- Person object for current 


person. 


decideName 


public static void decideName(thesis.Person p) 


collectTitle 

public static void collectTitle(java.lang.String sentence, 
java.lang.String firstName, 
java.lang.String lastName, 
java.lang.String currTitle, 


thesis.Person p) 


decideTitle 


public static void decideTitle(thesis.Person p) 


collectQuote 
public static boolean collectQuote(java.lang.String sentence, 
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thesis.Person p) 


collectOrganizations 
public static void collectOrganizations(java.lang.String sentence, 
thesis.Organization currOrg, 


thesis.Person p) 


decideOrganization 


public static void decideOrganization(thesis.Person p) 


collectFamily 

public static void collectFamily(java.lang.String sentence, 
java.lang.String firstName, 
java.lang.String lastName, 


thesis.Person p) 


collectFriends 
public static void collectFriends(java.lang.String sentence, 


thesis.Person p) 


decideLocation 


public static void decideLocation(thesis.Person p) 


main 
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public static void main(java.lang.String[] args) 
Parameters: 


args - 
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THIS PAGE INTENTIONALLY LEFT BLANK 


V2 


APPENDIX C 


Source code: 








Ordering.java 





import thesis.FileHandler; 
import thesis.Person; 
import java.util.Iterator; 


[** 
* @author Matthew W. Esparza 
* @date 2007 
* @class Ordering.java 
* @description This class is responsible for ordering the entity references. 


a 





public class Ordering { 


private static FileHandler fh = new FileHandler(); 





[** 

* @name autast 

* @param args 

* @return void 

* @input The input opened by the openFileChannel function is a 

* file containing the entity references. 

* @output The output is a file containing the entity references in 
ms order. 

* 

*/ 


public static void init(String[] args) { 


String name = null; 





fh.init ("MORETHAN100_080207_SORTED.txt"); 











String content = fh.openFileChannel ("MORETHAN100_080207.txt"); 
String[] entityBlocks = content.split("p_"); 
Person[] perArray = new Person[entityBlocks.length]; 








// Step 1: Get all of th ntities in Person objects 
for(int p = 0; p < entityBlocks.length; ptt) { 


perArray[p] = new Person(); 
String[] lines = entityBlocks[p].split("\n"); 
for(int t = 0; t < lines.length; t+t+) { 
if(lines[t].contains("is referenced") ) { 
name = lines[t].substring(0, 
lines[t].indexOf("_p")); 


perArray[p].addName (name) ; 
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} 
else if (lines[t].contains ("START REFERENCES") ) { 






































while (lines[t].contains ("END REFERENCES") == 
false) { 
String[] numArray = lines[tt1].split(" "); 





for(int n = 0; n < numArray.length; nt+t+) { 
if (numArray([n] .matches("[0-9]+")) { 


perArray[p].addReference (numArray[n]); 


ttt; 





String[] perlRefs, per2Refs; 


// SORTING 
for(int s = 0; s < perArray.length; st++) { 





perlRefs = new String[perArray[s].getReferences().size()]; 
Iterator pl = perArray[s].getReferences().iterator(); 
int idx = 0; 


while (pl.hasNext ()) { 


perlRefs[idx] = (String)pl.next(); 
idx++; 

} 

int loValue = -1; 

if(perlRefs.length > 0 && perlRefs[0] != null) { 
loValue = Integer.parseInt (perlRefs[0]); 





} 
for(int r = s+l; r < perArray.length; rt+t) { 
per2Refs = new String[perArray[r].getReferences().size()]; 


Iterator p2 = perArray[r].getReferences().iterator(); 
int idx2 = 0; 





while (p2.hasNext () ) { 


per2Refs[idx2] = (String)p2.next(); 
idx2++; 

} 

if(per2Refs.length > 0 && per2Refs[0] != null) { 


int val2 = Integer.parseInt (per2Refs[0]); 
if(val2 < loValue) { 


// Swap 
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Person temp = perArray[s]; 
perArray[s] = perArray[r]; 
perArray[r] temp; 


loValue = val2; 


} 


// PRINTING 
Iterator 1 = null, m = null; 
for(int i = 0; i < perArray.length; i++) { 





if(perArray[i] != null && perArray[i].getNames().isEmpty () 
== false) 


{ 


1 = perArray[i].getNames().iterator(); 


while (l1.hasNext () ) { 


String str = (String)l.next(); 
System.out.printlin("p_" + str + " is referenced " + 
perArray[i].getReferences().size() + 

" times."); 


System.err.printin(str); 





m = perArray[i].getReferences().iterator(); 
int count = 0; 
System. out.printlin("START REFERENCES") ; 




















while (m.hasNext () ) { 
String ref = (String)m.next(); 
if(count © 10 == 0 && count != 0) 


System. out.println(); 
else 
System.out.print(ref + " "); 


} 


System.out.printin(); 
System.out.println("END REFERENCES") ; 
System.out.println(); 























} 


try { 
fh.close(); 
} 


catch (Exception e) { 





e.printStackTrace(); 
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ReferenceCounter.java 





import java.io.*; 

import java.util.Enumeration; 
import java.util.Hashtable; 
import thesis.FileHandler; 
import qdxml.*; 




















/** 

* 

* @author Matthew W. Esparza 

* @date 2007 

* @name ReferenceCounter 

* @description Class implements the DocHandler interface from the Quick 

- and Dirty XML Parser. Bascially, given a minimum and 

* maximum number as thresholds, the class will search through 
* UniqueEntities.xml and retrieve th ntities that have the 
‘al desired number of references. 

* 


"Ss 


public class ReferenceCounter implements DocHandler { 











private static final int MIN_LIMIT = 99; 
private static final int MAX _LIMIT = 10000000; 
private static int entityCount = 0, count; 
private static String key, val, docList; 
private static String[] sHLeLal; 
private static ReferenceCounter rfc = new ReferenceCounter(); 
private static FileHandler fh = new FileHandler(); 

/* -- Starts implentation of DocHandler -- */ 

[** 

* @name startDocument 


* @param null 
* @return void 
ag 
public void startDocument () throws Exception { 





} 
[** 


* @name endDocument 
* @param null 
* @return void 





# 
public void endDocument () { 
} 
[** 
* @name startElement 
* @param String elem, Hashtable h 
* @return void 
* @description Parses the XML and stores the elements in a hashtable. 
rd 





public void startElement (String elem, Hashtable h) { 





Enumeration e = h.keys(); 
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while (e.hasMoreElements()) { 








key = (String) e.nextElement (); 
val (String) h.get (key); 


key.trim(); 


setVal(val); 





* @name endElement 
* @param String elem 
* @return void 
yd 
public void endElement (String elem) { 








/* if (elem.equalsIgnoreCase ("PERSON") ) { 
System.out.println(" end elem: " + elem); 
Pes 
} 


/** 
* @name text 
* @param String text 
* @return void 
Beg 
public void text (String text) { 


docList = text; 
processNumbers(docList); 


} 





/* -- Ends implementation of DocHandler -- */ 
[xx 
* @name setVal 


* @param String str 
* @return void 
me 
public static void setVal (String str) { 


val = str; 


} 


/** 
* @name getVal 
* @param null 
* @return void 
a 
public static String getVal() { 


return val; 
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[** 
* @name processNumbers 
* @param String str 
* @return void 
* @description Loops through UniqueEntities.xml and prints out 
3 reference indexes of entities 
¥ that meet the limit criteria. 
ia 
public static void processNumbers (String str) { 
initiel = strssplitt™ ™); 
int counter = 0; 


for (int i = 0; i < initial.length; i++) { 


counter++; 


setCounter(counter) ; 
if(counter > MIN_LIMIT && counter <= MAX_LIMIT) { 
if (initial[0].trim()-.length() > 0) { 
System.out.println(getVal() + " is referenced " + 


getCounter() + " times.\n"); 
System. out.printin("START REFERENCES") ; 























— 


for (int j = 0; j < initial.length; j++ 


if(j © 10 == 0 && j != 0) 
System. out.printin(); 
else 
System.out.print (initial[j] + " "); 


} 


System. out.printin(); 

System. out.println("END REFERENCES") ; 
System. out.printin(); 

entityCount++; 























} 


/** 
* @name retDocList 
* @param null 
* @return String[] 
iad 
public static String[] retDocList () { 


return initial; 


} 
[** 


* @name setCounter 
* @param int num 
* @return void 
dg 
public static void setCounter(int num) { 
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pee 
* €@name getCounter 
* @param null 
* @return int 
a 
public static int getCounter() 


return count; 


[** 
* @name init 
* €@param String args[] 
* @return void 
* @description Parses XML document through the use of quick and 
i dirty parser, which in turn calls the reference 
ms related functions above. 
i 


public static void init (String args[]) 


fh.init ("MoreThan100.txt"); 





// Open UniqueEntities.xml 


fh.openFileChannel ("Unique 


try { 





Entities.xml"); 





FileReader fr = new FileReader ("UniqueEntities.xml"); 
QDParser.parse(rfc, fr); 





catch (Exception e) { 


e.printStackTrace()j; 


} 


System.out.println(entityCount + " entities"); 


try { 


fh.close(); 





catch (IOException ioe) { 


ioe.printStackTrace()j; 
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Searcher.java 





import java.util.regex.*; 
import java.util.Iterator; 
import thesis.FileHandler; 
import thesis.Person; 


*+ 


/ 


+ + FF + F F F 


a 


@author 
@date 
@name 
@description 





Matthew W. Esparza 





20:07 

Searcher 

This class is responsible for finding the reference numbers 
output by the ReferenceCounter class. It searches through 


multiple files in the corpus directory. 


public class Searcher { 


private static String[] entityBlocks; 
private static FileHandler fh = new FileHandler(); 
private static Person[] perArr; 
[** 
* @name marchThrough 
* @param files 
* @return vyord 
* @description Scans through XML corpus of tagged news articles and 
* looks for <PERSON> tags. If the ID in the tag 
ud matches the one we are looking for, then we collect 
* 


+ 


ad 


the sentence it was found in and add it to the 
current person's "content block." 











public static void marchThrough(String[] files) { 
Pattern perId = Pattern.compile("PERSON ID=\"[0-9]+\""); 
Matcher idMatch = null; 
String path = null; 
String content = null; 
for (int = 0; < files.length; ett) { 


path = fh.getDirectoryName() + "\\" + files[e]; 
System.err.printin("Now processing: " + path); 


content = fh.openFileChannel (path) ; 
String[] lines = content.split("\n"); 
boolean firstTime = true; 





// Will find instance of "Person ID="XX" in XML files 
for(int j = 0; j < lines.length; j++) { 


idMatch = perId.matcher(lines[j]); 


while (idMatch.find() ) { 


String mention = idMatch.group()j; 
int firstQuote = mention.indexOf("\""); 
int lastQuote = mention.lastIndexOf("\""); 


String refNumb = mention.substring(firstQuote 
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Int 





+ 1, lastQuote); 
menNumb = Integer.parseInt (refNumb) ; 
boolean found = false; 
for(int 1 = 0; 


1 < perArr.length; 


Lae) 
if (perArr[1] 


'= null) { 


Iterator itr 


perArr[1].getReferences().iterator(); 





while (itr.hasNext ()) { 


int ref = 
Integer.parseInt ((String)itr.next()); 


if(menNumb == ref) { 





if(ref == 
Integer.parseInt ( (String) perArr[1].getReferences().get(0))) { 


firstTime = 
true; 
} 
else { 
firstTime = 
false; 
} 
found = true; 
if (perArr[1].getNames().size() > 0 && firstTime == true) { 
System.out.println("Name: " + perArr[1].getNames().get(0) + "\n"); 
firstTime 
false; 
} 


String noTags = 
lines[j].replaceAll("<.*?>", 


ae a 
perArr[1].addContent (noTags) ; 


break; 
} 
} 


Iterator con 


String[] conArr 


perArr[1].getContent().iterator(); 


new String[perArr[1].getContent().size()]; 


int conCount = 0; 


while (con.hasNext () ) { 


conArr[conCount ] 


(String) con.next (); 
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} 
if(conArr != null) { 
String[] newConArr = 
perArr[1].removeRedundancy (conArr) ; 
for(int o = 0; o < 


newConArr.length; ott) { 


System. out.printin(newConArr[o]); 


} 


} 





if (found == true) { 
break; 
} 
} 
} 
} 
lines = null; 
} 
} 
[** 
x @name ANAL 
* @param args 
* @throws Exception 
* @return void 
* @description Loops through the XML corpus and calls marchThrough. 
ee 





public static void init (String[] args) throws Exception{ 


























fh.init ("MORETHAN100_080207_SORTED_NOTAGS.txt") ; 

// Read Input File and split into sentences 

String name = null; 

String content = fh.openFileChannel 
("MORETHAN100_080207_SORTED.txt"); 

entityBlocks = content.split("p_"); 

perArr = new Person[entityBlocks.length]; 


fh.openDirectory(); 
String[] fileListing = fh.listDirectoryContents(); 


// Process references 
for(int p = 0; p < entityBlocks.length; ptt) { 


perArr[p] = new Person(); 
String[] lines = entityBlocks[p].split("\n"); 
for(int t = 0; t < lines.length; t+t+) { 

if (lines[t].contains("is referenced") ) { 


name = lines[t].substring(0, 
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lines[t].indexOf(" is")); 


if(name != null && name.length() > 0) { 
perArr[p].addName (name) ; 
} 


} 
else if (lines[t].contains ("START REFERENCES") ) { 









































while (lines[t].contains ("END REFERENCES") == 
false) { 
String[] numArray = lines[t + 
Ljisplatic™ 7 


for(int n = 0; n < numArray.length; 
nt++) { 


if (numArray[n] .matches 


(” [O= 
ye) ) 


{ 


perArr[p].addReference (numArray[n]); 


t++; 


} 


marchThrough(fileListing) ; 


try { 
fh.close(); 
} 


catch (Exception e) { 





e.printStackTrace(); 
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KeywordExtractor.java 

















import java.util.regex.*; 
import java.util.Iterator; 
import thesis.Person; 
import thesis.TupleArray; 
import thesis.FileHandler; 
[** 
* @class KeywordExtractor 
* @author Matthew W. Esparza 
* @description Contains functions to fill slots of biography. 
* Slots include Name, Job Title, Nationality, Tombstone Data, 
5a: Professional Relationships, Familial Relationships, 
* Enemies, Friends, Quotes [not counted as truth], Actions 
* @version 4.0 
yf 
public class KeywordExtractor { 
private static FileHandler fh = new FileHandler(); 
private static Person[] people; 
private static Pattern[] orgPats, contPats, cityPats, titlePats; 


public static Pattern[] 


int numOrgs 
orgPacts 


for(int z 0; 


String[]currOrg 


String orgName 
orgName 
String abbrev 
abbrev 


orgPats[z] 


return orgPats; 


} 


public static Pattern[] 
int numLocas = 
contPats 


for(int z 0; 
String[]currLoc 
String countryName 


contPats[z] 


initOrganPatterns (String[] 


orgs) { 


orgs.length; 
new Pattern[numOrgs]; 


z < numOrgs; 


Z++) { 


= orgs[z].split("\\|I"); 
= currOrg[0]; 


= orgName.replaceAll("\\p{Punct}", 


= abbrev.replaceAll("\\p{Punct}", 


TAL) 


currOrg[1]; 


i ie 





initCountryPatterns (String[] 


+ orgName + ") | 


" 
’ 


Pattern.compile(" (" 
("+ abbrev + ") 
Pattern.DOTALL | 
Pattern.MULTILINE | 
Pattern. CASE_INSENSITIVE) ; 


locations) { 


locations.length; 
new Pattern[numLocas]; 


z < numLocas; 


g++) 4 


= locations[z].split(","); 


currLoc[1].trim(); 

Pattern.compile(" (" + countryName + 
") ", Pattern.DOTALL | 
Pattern.MULTILINE | 
Pattern. CASE_INSENSITIVE) ; 


184 





return contPats; 














public static Pattern[] initCityPatterns(String[] locations) { 
int numLocas = locations.length; 
cityPats = new Pattern[numLocas]; 
for(int z = 0; z < numLocas; z+t+) { 
String[]currLoc = locations[z].split(","); 
String cityName = currLoc[0].trim(); 
cityPats[z] = Pattern.compile(" ("+ cityName + ") 
", Pattern.DOTALL | 
Pattern.MULTILINE | 
Pattern.CASE_INSENSITIVE) ; 
} 
return contPats; 
} 
public static Pattern[] initTitlePatterns(String[] titles) { 
int numTitles = titles.length; 
titlePats = new Pattern[numTitles]; 
for(int z = 0; z < numTitles; z+t) { 
String currTitle = titles[z]; 
// Some titles like "pace-bowler" have an embedded dash 
// that needs to be turned into a literal "\\-" for the 
// purpose of the regex. 
String reFormTitle = currTitle.replaceAl1l("\\-", "\\\\-"); 
titlePats[z] = Pattern.compile(" ?" + reFormTitle + 
"'") Pattern.DOTALL | 
Pattern.MULTILINE | 
Pattern.CASE_INSENSITIVE) ; 
} 
return titlePats; 
} 
[** 
* @name IsolateEntities 
* @purpose Reads string (content of file) and extracts contents of next 
bs Entity by reading everything between Name: and the next Name: 
* 
* @param Takes no parameters. 
* @return Returns an array of entity names and the sentences for each 
* entity. 
ef 





public static String[] IsolateEntities (String fileContent) { 


// Will give you all the content between the colon in the first 
// Name: 

// Until it reaches the next Nam 

// SHOULD correspond to the number of entities collected in file. 
String[] entities = fileContent.split ("Name:"); 











return entities; 
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+ 


@name 
@purpose 


@param 


+ + F + F F F HF 


@return 


* 


as 


collectNames 

Reads strings (sentences from file) and extracts names of 
Entity by reading everything between given first name and 
last name. 





String[] array -- Sentences from file. 
Person p —-- Person object for current person. 
Function does not return anything. It does, however, 


output to file specified in main. 


public static void collectName(String sentence, String firstName, String 


lastName, Person p) { 


String regex = null; 
if(lastName != null) { 
regex = "(" + firstName + ")" + ".*?" + "(" + lastName + 
be Tae 
} 
else 
regex = "(" + firstName + ")"; 


// Compile a regular expression to look for name (with anything in 
// between first and last) 
Pattern fullName = Pattern.compile(regex, Pattern.CASE_INSENSITIVE 


| Pattern.DOTALL | 
Pattern.MULTILINE); 





Matcher match = fullName.matcher (sentence); 


while (match. find()) { 


// Store current string found and remove any excess 
// characters 
String current = match.group(); 

current = current.replaceAll("\\) |\\] 1% 





?[*\\p{Alpha}\\piSpace}]*", ""); 
if(current.length() > 30) { 
String newReg = "[*" + firstName + "]" + firstName + 
".*?" + lastName; 
Pattern preciseNam = Pattern.compile(newReg, 





Pattern.CASE_INSENSITIVE | 

Pattern.DOTALL | 

Pattern. MULTILINE) ; 
Matcher preciseMatch = preciseName.matcher (current) ; 





while (preciseMatch.find()) { 


String preciseCurrent = preciseMatch.group(); 
preciseCurrent = 
preciseCurrent.replaceAll("\\) |* 
?[*\\p{Alpha}\\p{Space}]*", ""); 











// Last ditch effort to grab name 
if (preciseCurrent.length() > 30) { 





String simpReg = firstName + lastName; 
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Pattern simpleName = 
Pattern. compile (simpReg, 
Pattern.CASE_INSENSITIVE | 
Pattern.DOTALL | 
Pattern.MULTILINE) ; 


Matcher simpleMatch = 
simpleName.matcher (current) ; 





while (simpleMatch.find()) { 


String simpCurrent = 
simpleMatch.group(); 


simpCurrent = 
simpCurrent.replaceAll("* 


2?[*\\p{Alpha}\\p{Space}]*", mm). 


if(simpCurrent != null) { 
p.addName (simpCurrent) ; 


} 


} 








else { 
if (preciseCurrent != null) { 
p.addName (preciseCurrent) ; 
} 
} 
} 
} 
else { 
if(current != null) { 


p.addName (current) ; 


} 


} 


public static void decideName (Person p) { 





Iterator t = p.getNames().iterator(); 
TupleArray tArray = new TupleArray(p.getNames().size()); 


while (t.hasNext ()) { 


tArray.addString((String)t.next()); 
} 
System. out.printin("<NAME>") ; 
System. out.println("<DECISION>" + tArray.argMax()+"</DECISION>") ; 
System. out.println("</NAME>") ; 














} 


public static void collectTitle(String sentence, Pattern currPat, Person 
p){ 





// Loop to match all of the patterns. 
Matcher titleMatcher = currPat.matcher (sentence) ; 





while (titleMatcher.find()) { 
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// Store current string found and remove any excess 
// characters 
String current = titleMatcher.group()j; 





if(current != null) { 
current = current.replaceAll("% 
7(*\\p{Alpha} \\pispace}]*",; ™"); 
p.addTitle (current) ; 





} 


public static void decideTitle(Person p) { 





Iterator t = p.getTitles().iterator(); 
TupleArray tArray = new TupleArray(p.getTitles().size()); 
int count = 0; 


while (t.hasNext ()) { 


tArray.addString((String)t.next()); 
count++; 


} 


System. out.printin("<TITLE>"); 
System. out.println("<DECISION>" + tArray.argMax()+"</DECISION>") ; 
System. out.println("</TITLE>") ; 

















} 


public static boolean collectQuote(String sentence, Person p) { 


boolean quoteFound = false; 
String quotation = null; 
String author = null; 


Pattern quotes = Pattern.compile("(\".*\")", Pattern.MULTILINE | 
Pattern.DOTALL) ; 


Pattern whoSaid = Pattern.compile("\\w* \\w*? 
?(?=said|says|stated)", Pattern.DOTALL | 
Pattern.MULTILINE | Pattern.CASE_INSENSITIVE) ; 


Matcher quoteMatch = quotes.matcher (sentence) ; 
Matcher whoSaidMatch = whoSaid.matcher (sentence) ; 


while (whoSaidMatch.find()) { 


author = whoSaidMatch.group(); 
while (quoteMatch.find()) { 


quotation = quoteMatch.group(); 


if(author != null && author != " " && quotation != 
null) { 


String quoteInfo = author + " : " + quotation; 
quoteInfo = quoteInfo.replaceAll ("*% 

?[*\\p{Alpha}\\p{Space}]*", ""); 
quoteFound = true; 
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return quoteFound; 


public static void collectLifespanData(String sentence, Person p) { 


Pattern birthWords = Pattern.compile("\\w* \\w*? 
?(?=born|birth)", Pattern.MULTILINE | 
Pattern. CASE INSENSITIVE) ; 


Pattern deathWords = Pattern.compile("\\w* \\w*? 
? (?=died|death|passed away)", 
Pattern.MULTILINE | 
Pattern. CASE INSENSITIVE) ; 


Pattern years = Pattern.compile("[0-9]{4}?", 
Pattern.MULTILINE) ; 


Matcher matchBirth = birthWords.matcher (sentence) ; 
Matcher matchDeath = deathWords.matcher (sentence) ; 
Matcher matchYears = years.matcher (sentence) ; 
String tombInfo = null; 


while (matchBirth.find() ) { 
tombInfo = matchBirth.group(); 
if(tombInfo != null) { 
p.addTombstoneData (tombInfo) ; 
} 
} 
while (matchDeath.find()) { 
tombInfo = matchDeath.group(); 
if(tombInfo != null) { 
p.addTombstoneData (tombInfo) ; 
} 
} 
while (matchYears.find()) { 
if(matchYears.group() != null) { 
p.addTombstoneData (sentence) ; 





} 


public static void collectFamily(String sentence, Person p) { 


Pattern familyWords = Pattern.compile("(family ) | (father 
)| (grandfather )| (grandmother )|" + 
"(in(-| )law)| (mother )| (sister 


) | Gerother )", Pattern.CASE_INSENSITIVE 
| Pattern.MULTILINE) ; 


Matcher matchFam = familyWords.matcher (sentence) ; 


while (matchFam.find() ) { 


p.addFamily (sentence) ; 
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public static void collectFriends (String sentence, Person p) { 


Pattern meetTerms = Pattern.compile("meeting |meet |summit |met 
|gather "+ "|assemble |met with 
|gathering " + "|assembled ", 
Pattern.CASE_INSENSITIVE | 
Pattern. MULTILINE) ; 





Matcher matchMeet = meetTerms.matcher (sentence) ; 
while (matchMeet.find()) { 
p.addFriends (sentence) ; 


} 





private static void collectCountry (String sentence, Pattern currLocPat, 


Person p) { 
Matcher locMatcher = currLocPat.matcher (sentence) ; 
while (locMatcher.find()) { 
p-addCountry (locMatcher.group())j; 
} 
} 
public static void decideCountry (Person p) { 
Iterator £ = p.getCountry().iterator(); 
TupleArray tArray = new TupleArray(p.getCountry().size()); 
int count = 0; 


while (t.hasNext ()) { 


tArray.addString((String)t.next()); 
count++; 


} 


System. out.println("<COUNTRY>") ; 
System. out.println("<DECISION>" + tArray.argMax()+"</DECISION>") ; 
System. out.println("</COUNTRY>") ; 








} 





private static void collectCity(String sentence, Pattern currLocPat, 


Person p) { 
Matcher locMatcher = currLocPat.matcher (sentence) ; 
while (locMatcher.find()) { 
p.addCity (locMatcher.group())j; 
} 
} 
private static void decideCity (Person p) { 
Iterator t = p.getCity().iterator(); 
TupleArray tArray = new TupleArray(p.getCity().size()); 
int count = 0; 


while (t.hasNext ()) { 
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public 


public 


} 
[** 


tArray.addString((String)t.next()); 
count++; 


} 


System. out.printin("<CITY>"); 
System. out.println("<DECISION>" + tArray.argMax() +"</DECISION>") ; 
System. out.println("</CITY>"); 











static void collectOrganizations (String sentence, Pattern 
currOrgPat, Person p) { 


Matcher orgMatcher = currOrgPat.matcher (sentence) ; 


while (orgMatcher.find() ) { 
p.addOrganization(orgMatcher.group()); 
} 


static void decideOrganization(Person p) { 


Iterator t = p.getOrganizations().iterator(); 
TupleArray tArray = new TupleArray (p.getOrganizations().size()); 
int count = 0; 


while (t.hasNext ()) { 


tArray.addString((String)t.next()); 
count++; 


} 


System. out.printin("<ORGANIZATION>") ; 
System. out.println("<DECISION>" + tArray.argMax() +"</DECISION>") ; 
System. out.println("</ORGANIZATION>") ; 








* @name main 
* @param args 
* @return Returns nothing. 


wf 
public 


static void main(String[] args) { 


// Initialize Output File 

fh.init ("BioData_081307.xm1"); 

System. out.println("<?xml version=\"1.0\" encoding=\"ISO-8859- 
LAs) 7 


// Read Input File and split into entity chunks 














String content = fh.openFileChannel ("MORETHAN1O0O_ORIG.txt"); 
String[] entities = IsolateEntities(content) ; 

int numEntities = entities.length; 

people = new Person[numEntities]; 








// Open file that contains titles from corpus 








String titleFil = fh.openFileChannel ("Titles.txt"); 
String[] titles = titleFile.split("\r\n"); 
int numTitles = titles.length; 


// Open file that contains locations from corpus 
String locFil = fh.openFileChannel ("Locations.txt"); 
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String[] locales = locFile.split("\r\n"); 
int numLocales = locales.length; 


// Open file that contains organizations from corpus 

String orgFil = fh.openFileChannel ("Organizations.txt")j; 
String[] orgs orgFile.split("\r\n"); 

int numOrgs orgs.length; 





// Init Location, Title, and Organization patterns 
initTitlePatterns (titles); 

initCountryPatterns (locales); 
initCityPatterns(locales); 
initOrganPatterns (orgs) ; 


// Store each line of the entity's content for processing. 
System. out.println("<BIOGRAPHIES>") ; 








for(int p = 1; p < numEntities; ptt) { 





System. out.println("<PERSON ID=\"" + p + "\">"); 


System.err.println("Now serving number: " + p); 
people[p] = new Person(); 
String[] entContent = entities[p].split("\n"); 


// Determine name of current entity. 
// Every name starts with a blank space, so start at 1 











String name = entContent[0].substring(1, 
entContent[0].length() - 3); 

String firstName = null; 

String lastName = null; 

int numSentences = 0; 

if (name.contains(" ")){ 
firstName = name.substring(0, name.indexOf(" ")); 
lastName = name.substring(name.indexOf(" ") + 1, 

name.length()); 

numSentences = entContent.length; 

} 

else { 
firstName = name.substring(0, name.length() - 3); 
numSentences = entContent.length; 


System.err.printin(firstName) ; 


} 


for(int s = 1; s < numSentences; s++) { 





String currentSentence = entContent[s]; 

if (currentSentence.contains ("Now looking at reference 
number ") || currentSentence.contains 
("Information from corpus") || 
currentSentence.length() == 0) { 





continue; 


} 


// Collect quotes: if one is found, then the current 

// sentence is a quote and we should skip the rest of 

// the loop. 

if (collectQuote(currentSentence, people[p])) { 
continue; 
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// Collect name 
collectName(currentSentence, firstName, lastName, 
people[p]); 





// Collect title 
for(int t = 0; t < numTitles; t++) { 
collectTitle(currentSentence, titlePats[t], 
people[p]); 





} 


// Collect tombstone information 
collectLifespanData(currentSentence, people[p]); 





// Collect family information 
collectFamily(currentSentence, people[p]); 





// Collect friend information 
collectFriends(currentSentence, people[p]); 








// Collect country information 
for(int 1 = 0; 1 < numbLocales; 1++) { 
collectCountry(currentSentence, contPats[1], 
people[p]); 
} 


// Collect city information 
for(int m = 0; m < numLocales; m+t+) { 
collectCity(currentSentence, cityPats[m], 
people[p]); 
} 


// Collect organization information 
for(int n = 0; n < numOrgs; n+t+) { 
collectOrganizations(currentSentence, 
orgPats[n], people[p]); 


} 


decideName (people[p]); 
decideTitle(people[p]); 
decideCountry(people[p]); 
decideCity(people[p]); 
decideOrganization(people[p]); 











System. out.printlin("<FAMILY>") ; 

Iterator k = people[p].getFamily().iterator(); 

String[] famArray = new 
String[people[p].getFamily().size()]; 





int c = 0; 

while (k.hasNext () ) { 
famArray[c] = (String)k.next(); 
Ctt; 


} 





String[] redundFreeFam = 
people[p] .removeRedundancy (famArray) ; 





for(int h = 0; h < redundFreeFam.length; h++) { 





System. out.printin("<F" + h + ">" + redundFreeFam([h] 
+ we/pe + h + ee ys 
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} 


} 
System.err.printin("Number of family inclusions: " + 
redundFreeFam.length) ; 





System. out.println("</FAMILY>"); 





System. out.printin("<ASSOCIATES>") ; 
Iterator 1 = people[p].getFriends().iterator(); 
String[] assocArray = new 
String[people[p] .getFriends().size()]; 








int d = 0; 

while (l.hasNext ()) { 
assocArray[d] = (String)l.next(); 
d++; 

} 





String[] redundFreeAssoc = 
people[p] .removeRedundancy (assocArray) ; 





for(int h = 0; h < redundFreeAssoc.length; h++) { 


System.out.printin("<R" + h + ">" + 
redundFreeAssoc[h] + "</R" + h + ">"); 
} 
System.err.printin("Number of associate inclusions: " + 
redundFreeAssoc.length) ; 
System. out.println("</ASSOCIATES>") ; 
System. out.println("</PERSON>") ; 











System. out.println("</BIOGRAPHIES>") ; 


try { 


} 





fh.close(); 


catch (Exception e) { 


e.printStackTrace(); 
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